* How to add pseudo vector types @ 2021-07-14 17:37 Yuan Fu 2021-07-14 17:44 ` Eli Zaretskii 2021-07-14 17:47 ` Stefan Monnier 0 siblings, 2 replies; 370+ messages in thread From: Yuan Fu @ 2021-07-14 17:37 UTC (permalink / raw) To: emacs-devel Say I want to expose tree-sitter’s parser to lisp, and I define it as a new pseudo vector. struct Lisp_TS_Parser { union vectorlike_header header; Lisp_Object buffer; TSParser *parser; TSTree *tree; TSInput input; }; Now if I want to return a Lisp_Object, do I initialize this struct and cast it into a Lisp_Object and return it? Like: Lisp_TS_parser lisp_parser; ... return (Lisp_Object)lisp_parser; And how do I use a USER_PTR? Do I cast it into (struct Lisp_User_Ptr) and use it normally, or is there some helper function that I should use? Are there examples of using pseudo vectors? Thanks Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-14 17:37 How to add pseudo vector types Yuan Fu @ 2021-07-14 17:44 ` Eli Zaretskii 2021-07-14 17:47 ` Stefan Monnier 1 sibling, 0 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-07-14 17:44 UTC (permalink / raw) To: Yuan Fu; +Cc: emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Wed, 14 Jul 2021 13:37:47 -0400 > > Say I want to expose tree-sitter’s parser to lisp, and I define it as a new pseudo vector. > > struct Lisp_TS_Parser > { > union vectorlike_header header; > Lisp_Object buffer; > TSParser *parser; > TSTree *tree; > TSInput input; > }; Inside Emacs, or in a module? I assume the former. > Now if I want to return a Lisp_Object, do I initialize this struct and cast it into a Lisp_Object and return it? Like: > > Lisp_TS_parser lisp_parser; > ... > return (Lisp_Object)lisp_parser; No, you need to define a proper Lisp_Object, and then define functions/macros to make a Lisp_Object that represents the struct, and vice versa. > And how do I use a USER_PTR? Do I cast it into (struct Lisp_User_Ptr) and use it normally, or is there some helper function that I should use? Look in lisp.h, you will find some infrastructure there. > Are there examples of using pseudo vectors? Every buffer, window, frame, and overlay is a pseudo vector. Look how these are handled in lisp.h and in the rest of the code, and you will find a lot of examples. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-14 17:37 How to add pseudo vector types Yuan Fu 2021-07-14 17:44 ` Eli Zaretskii @ 2021-07-14 17:47 ` Stefan Monnier 2021-07-14 23:48 ` Yuan Fu 1 sibling, 1 reply; 370+ messages in thread From: Stefan Monnier @ 2021-07-14 17:47 UTC (permalink / raw) To: Yuan Fu; +Cc: emacs-devel Yuan Fu [2021-07-14 13:37:47] wrote: > Say I want to expose tree-sitter’s parser to lisp, and I define it as a new pseudo vector. > > struct Lisp_TS_Parser > { > union vectorlike_header header; > Lisp_Object buffer; > TSParser *parser; > TSTree *tree; > TSInput input; > }; > > Now if I want to return a Lisp_Object, do I initialize this struct and cast > it into a Lisp_Object and return it? Like: > > Lisp_TS_parser lisp_parser; > ... > return (Lisp_Object)lisp_parser; > > > And how do I use a USER_PTR? Do I cast it into (struct Lisp_User_Ptr) and > use it normally, or is there some helper function that I should use? Most likely you'll want some of your functions to take objects that should be "tree sitter parsers", but you'll only receive a Lisp_Object so you'll need to be able to *test* that the object you received is indeed a "tree sitter parser". For that reason you'll probably want to add a new entry to `pvec_type` rather than use a USER_PTR. > Are there examples of using pseudo vectors? Thanks Lots of them: actual vectors, processes, threads, mutexes, overlays, you name it. Stefan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-14 17:47 ` Stefan Monnier @ 2021-07-14 23:48 ` Yuan Fu 2021-07-15 0:26 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-07-14 23:48 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel [-- Attachment #1: Type: text/plain, Size: 974 bytes --] >> >> And how do I use a USER_PTR? Do I cast it into (struct Lisp_User_Ptr) and >> use it normally, or is there some helper function that I should use? > > Most likely you'll want some of your functions to take objects that > should be "tree sitter parsers", but you'll only receive a Lisp_Object > so you'll need to be able to *test* that the object you received is > indeed a "tree sitter parser". > > For that reason you'll probably want to add a new entry to `pvec_type` > rather than use a USER_PTR. Actually, what is the correct way to provide a pointer from a dynamic module to Emacs core? I tried to use USER_PTR, but the dynamic module can only return an emacs_value, and to convert an emacs_value to a Lisp_Object, I need to use value_to_lisp, which is not exposed by emacs-module.c. I want to provide individual tree-sitter language definitions from dynamic modules so that one don’t need to compile Emacs with language definitions. Yuan [-- Attachment #2: Type: text/html, Size: 6845 bytes --] ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-14 23:48 ` Yuan Fu @ 2021-07-15 0:26 ` Yuan Fu 2021-07-15 2:48 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-07-15 0:26 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel [-- Attachment #1: Type: text/plain, Size: 1159 bytes --] > On Jul 14, 2021, at 7:48 PM, Yuan Fu <casouri@gmail.com> wrote: > >>> >>> And how do I use a USER_PTR? Do I cast it into (struct Lisp_User_Ptr) and >>> use it normally, or is there some helper function that I should use? >> >> Most likely you'll want some of your functions to take objects that >> should be "tree sitter parsers", but you'll only receive a Lisp_Object >> so you'll need to be able to *test* that the object you received is >> indeed a "tree sitter parser". >> >> For that reason you'll probably want to add a new entry to `pvec_type` >> rather than use a USER_PTR. > > > Actually, what is the correct way to provide a pointer from a dynamic module to Emacs core? I tried to use USER_PTR, but the dynamic module can only return an emacs_value, and to convert an emacs_value to a Lisp_Object, I need to use value_to_lisp, which is not exposed by emacs-module.c. > > I want to provide individual tree-sitter language definitions from dynamic modules so that one don’t need to compile Emacs with language definitions. I just realized that I can regard emacs_value just as Lisp_Object. Is that right? Yuan [-- Attachment #2: Type: text/html, Size: 7416 bytes --] ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-15 0:26 ` Yuan Fu @ 2021-07-15 2:48 ` Yuan Fu 2021-07-15 6:39 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-07-15 2:48 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel [-- Attachment #1: Type: text/plain, Size: 930 bytes --] I defined two pseudo vectors for tree-sitter's parser and node and packaged a dynamic module for tree-sitter’s json language definition. I also wrapped a few tree-sitter functions just to test if everything works. Please have a look. I’m sure there are some problems because I mainly wrote by copy, paste and modifying from other code I found in Emacs source. To try out this patch, get tree-sitter from https://github.com/tree-sitter/tree-sitter.git <https://github.com/tree-sitter/tree-sitter.git>, make and make install it. Then unzip json-module.zip to get the source of the json dynamic module. If my Makefile is correct, make'ing it should produce a tree-sitter-json.so. Then if you apply ts.patch, compile emacs, and run this snippet, you should get a string representation of the root node. (require 'tree-sitter-json) (tree-sitter-node-string (tree-sitter-parse "[1,2]" (tree-sitter-json))) Yuan [-- Attachment #2.1: Type: text/html, Size: 1359 bytes --] [-- Attachment #2.2: ts.patch --] [-- Type: application/octet-stream, Size: 15270 bytes --] From 85baf92975224ea99b7f68d5854342803c61f1d1 Mon Sep 17 00:00:00 2001 From: Yuan Fu <casouri@gmail.com> Date: Wed, 14 Jul 2021 22:26:42 -0400 Subject: [PATCH] checkpoint --- configure.ac | 27 ++++++++- src/Makefile.in | 11 +++- src/alloc.c | 13 +++++ src/emacs.c | 4 ++ src/lisp.h | 2 + src/print.c | 17 ++++++ src/tree_sitter.c | 145 ++++++++++++++++++++++++++++++++++++++++++++++ src/tree_sitter.h | 87 ++++++++++++++++++++++++++++ 8 files changed, 302 insertions(+), 4 deletions(-) create mode 100644 src/tree_sitter.c create mode 100644 src/tree_sitter.h diff --git a/configure.ac b/configure.ac index 830f33844b..42d2d43455 100644 --- a/configure.ac +++ b/configure.ac @@ -454,6 +454,7 @@ AC_DEFUN OPTION_DEFAULT_OFF([imagemagick],[compile with ImageMagick image support]) OPTION_DEFAULT_ON([native-image-api], [don't use native image APIs (GDI+ on Windows)]) OPTION_DEFAULT_IFAVAILABLE([json], [compile with native JSON support]) +OPTION_DEFAULT_IFAVAILABLE([tree-sitter], [compile with tree-sitter]) OPTION_DEFAULT_ON([xft],[don't use XFT for anti aliased fonts]) OPTION_DEFAULT_ON([harfbuzz],[don't use HarfBuzz for text shaping]) @@ -2963,6 +2964,23 @@ AC_DEFUN AC_SUBST(JSON_CFLAGS) AC_SUBST(JSON_OBJ) +HAVE_TREE_SITTER=no +TREE_SITTER_OBJ= + +if test "${with_tree_sitter}" != "no"; then + EMACS_CHECK_MODULES([TREE_SITTER], [tree-sitter >= 0.0], + [HAVE_TREE_SITTER=yes], [HAVE_TREE_SITTER=no]) + if test "${HAVE_TREE_SITTER}" = yes; then + AC_DEFINE(HAVE_TREE_SITTER, 1, [Define if using tree-sitter.]) + TREE_SITTER_LIBS=-ltree-sitter + TREE_SITTER_OBJ="tree_sitter.o" + fi +fi + +AC_SUBST(TREE_SITTER_LIBS) +AC_SUBST(TREE_SITTER_CFLAGS) +AC_SUBST(TREE_SITTER_OBJ) + NOTIFY_OBJ= NOTIFY_SUMMARY=no @@ -4028,6 +4046,12 @@ AC_DEFUN *) MISSING="$MISSING json" WITH_IFAVAILABLE="$WITH_IFAVAILABLE --with-json=ifavailable";; esac +case $with_tree_sitter,$HAVE_TREE_SITTER in + no,* | ifavailable,* | *,yes) ;; + *) MISSING="$MISSING tree-sitter" + WITH_IFAVAILABLE="$WITH_IFAVAILABLE --with-tree-sitter=ifavailable";; +esac + if test "X${MISSING}" != X; then # If we have a missing library, and we don't have pkg-config installed, # the missing pkg-config may be the reason. Give the user a hint. @@ -5833,7 +5857,7 @@ AC_DEFUN optsep= emacs_config_features= for opt in ACL CAIRO DBUS FREETYPE GCONF GIF GLIB GMP GNUTLS GPM GSETTINGS \ - HARFBUZZ IMAGEMAGICK JPEG JSON LCMS2 LIBOTF LIBSELINUX LIBSYSTEMD LIBXML2 \ + HARFBUZZ IMAGEMAGICK JPEG JSON TREE-SITTER LCMS2 LIBOTF LIBSELINUX LIBSYSTEMD LIBXML2 \ M17N_FLT MODULES NATIVE_COMP NOTIFY NS OLDXMENU PDUMPER PNG RSVG SECCOMP \ SOUND THREADS TIFF \ TOOLKIT_SCROLL_BARS UNEXEC X11 XAW3D XDBE XFT XIM XPM XWIDGETS X_TOOLKIT \ @@ -5902,6 +5926,7 @@ AC_DEFUN Does Emacs use -lxft? ${HAVE_XFT} Does Emacs use -lsystemd? ${HAVE_LIBSYSTEMD} Does Emacs use -ljansson? ${HAVE_JSON} + Does Emacs use -ltree-sitter? ${HAVE_TREE_SITTER} Does Emacs use the GMP library? ${HAVE_GMP} Does Emacs directly use zlib? ${HAVE_ZLIB} Does Emacs have dynamic modules support? ${HAVE_MODULES} diff --git a/src/Makefile.in b/src/Makefile.in index 79cddb35b5..bfdfda566e 100644 --- a/src/Makefile.in +++ b/src/Makefile.in @@ -320,6 +320,10 @@ JSON_LIBS = JSON_CFLAGS = @JSON_CFLAGS@ JSON_OBJ = @JSON_OBJ@ +TREE_SITTER_LIBS = @TREE_SITTER_LIBS@ +TREE_SITTER_FLAGS = @TREE_SITTER_FLAGS@ +TREE_SITTER_OBJ = @TREE_SITTER_OBJ@ + INTERVALS_H = dispextern.h intervals.h composite.h GETLOADAVG_LIBS = @GETLOADAVG_LIBS@ @@ -372,7 +376,7 @@ EMACS_CFLAGS= $(WEBKIT_CFLAGS) $(LCMS2_CFLAGS) \ $(SETTINGS_CFLAGS) $(FREETYPE_CFLAGS) $(FONTCONFIG_CFLAGS) \ $(HARFBUZZ_CFLAGS) $(LIBOTF_CFLAGS) $(M17N_FLT_CFLAGS) $(DEPFLAGS) \ - $(LIBSYSTEMD_CFLAGS) $(JSON_CFLAGS) \ + $(LIBSYSTEMD_CFLAGS) $(JSON_CFLAGS) $(TREE_SITTER_CFLAGS) \ $(LIBGNUTLS_CFLAGS) $(NOTIFY_CFLAGS) $(CAIRO_CFLAGS) \ $(WERROR_CFLAGS) ALL_CFLAGS = $(EMACS_CFLAGS) $(WARN_CFLAGS) $(CFLAGS) @@ -406,7 +410,8 @@ base_obj = thread.o systhread.o \ $(if $(HYBRID_MALLOC),sheap.o) \ $(MSDOS_OBJ) $(MSDOS_X_OBJ) $(NS_OBJ) $(CYGWIN_OBJ) $(FONT_OBJ) \ - $(W32_OBJ) $(WINDOW_SYSTEM_OBJ) $(XGSELOBJ) $(JSON_OBJ) + $(W32_OBJ) $(WINDOW_SYSTEM_OBJ) $(XGSELOBJ) $(JSON_OBJ) \ + $(TREE_SITTER_OBJ) obj = $(base_obj) $(NS_OBJC_OBJ) ## Object files used on some machine or other. @@ -516,7 +521,7 @@ LIBES = $(FREETYPE_LIBS) $(FONTCONFIG_LIBS) $(HARFBUZZ_LIBS) $(LIBOTF_LIBS) $(M17N_FLT_LIBS) \ $(LIBGNUTLS_LIBS) $(LIB_PTHREAD) $(GETADDRINFO_A_LIBS) $(LCMS2_LIBS) \ $(NOTIFY_LIBS) $(LIB_MATH) $(LIBZ) $(LIBMODULES) $(LIBSYSTEMD_LIBS) \ - $(JSON_LIBS) $(LIBGMP) $(LIBGCCJIT) + $(JSON_LIBS) $(LIBGMP) $(LIBGCCJIT) $(TREE_SITTER_LIBS) ## FORCE it so that admin/unidata can decide whether this file is ## up-to-date. Although since charprop depends on bootstrap-emacs, diff --git a/src/alloc.c b/src/alloc.c index 76d8c7ddd1..f144e053f2 100644 --- a/src/alloc.c +++ b/src/alloc.c @@ -50,6 +50,10 @@ Copyright (C) 1985-1986, 1988, 1993-1995, 1997-2021 Free Software #include TERM_HEADER #endif /* HAVE_WINDOW_SYSTEM */ +#ifdef HAVE_TREE_SITTER +#include "tree_sitter.h" +#endif + #include <flexmember.h> #include <verify.h> #include <execinfo.h> /* For backtrace. */ @@ -3144,6 +3148,15 @@ cleanup_vector (struct Lisp_Vector *vector) if (uptr->finalizer) uptr->finalizer (uptr->p); } +#ifdef HAVE_TREE_SITTER + else if (PSEUDOVECTOR_TYPEP (&vector->header, PVEC_TS_PARSER)) + { + struct Lisp_TS_Parser *lisp_parser + = PSEUDOVEC_STRUCT (vector, Lisp_TS_Parser); + ts_tree_delete(lisp_parser->tree); + ts_parser_delete(lisp_parser->parser); + } +#endif #ifdef HAVE_MODULES else if (PSEUDOVECTOR_TYPEP (&vector->header, PVEC_MODULE_FUNCTION)) { diff --git a/src/emacs.c b/src/emacs.c index 60a57a693c..ede390231d 100644 --- a/src/emacs.c +++ b/src/emacs.c @@ -85,6 +85,7 @@ #define MAIN_PROGRAM #include "intervals.h" #include "character.h" #include "buffer.h" +#include "tree_sitter.h" #include "window.h" #include "xwidget.h" #include "atimer.h" @@ -2057,6 +2058,9 @@ main (int argc, char **argv) syms_of_floatfns (); syms_of_buffer (); + #ifdef HAVE_TREE_SITTER + syms_of_tree_sitter (); + #endif syms_of_bytecode (); syms_of_callint (); syms_of_casefiddle (); diff --git a/src/lisp.h b/src/lisp.h index 4fb8923678..e439447283 100644 --- a/src/lisp.h +++ b/src/lisp.h @@ -1070,6 +1070,8 @@ DEFINE_GDB_SYMBOL_END (PSEUDOVECTOR_FLAG) PVEC_CONDVAR, PVEC_MODULE_FUNCTION, PVEC_NATIVE_COMP_UNIT, + PVEC_TS_PARSER, + PVEC_TS_NODE, /* These should be last, for internal_equal and sxhash_obj. */ PVEC_COMPILED, diff --git a/src/print.c b/src/print.c index d4301fd7b6..e20a1d065a 100644 --- a/src/print.c +++ b/src/print.c @@ -48,6 +48,10 @@ Copyright (C) 1985-1986, 1988, 1993-1995, 1997-2021 Free Software # include <sys/socket.h> /* for F_DUPFD_CLOEXEC */ #endif +#ifdef HAVE_TREE_SITTER +#include "tree_sitter.h" +#endif + struct terminal; /* Avoid actual stack overflow in print. */ @@ -1853,6 +1857,19 @@ print_vectorlike (Lisp_Object obj, Lisp_Object printcharfun, bool escapeflag, } break; #endif + +#ifdef HAVE_TREE_SITTER + case PVEC_TS_PARSER: + print_c_string ("#<tree-sitter-parser in ", printcharfun); + print_string (BVAR (XTS_PARSER (obj)->buffer, name), printcharfun); + printchar ('>', printcharfun); + break; + case PVEC_TS_NODE: + print_c_string ("#<tree-sitter-node", printcharfun); + printchar ('>', printcharfun); + break; +#endif + default: emacs_abort (); } diff --git a/src/tree_sitter.c b/src/tree_sitter.c new file mode 100644 index 0000000000..f2134c571a --- /dev/null +++ b/src/tree_sitter.c @@ -0,0 +1,145 @@ +/* Tree-sitter integration for GNU Emacs. + +Copyright (C) 2021 Free Software Foundation, Inc. + +This file is part of GNU Emacs. + +GNU Emacs is free software: you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation, either version 3 of the License, or (at +your option) any later version. + +GNU Emacs is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +You should have received a copy of the GNU General Public License +along with GNU Emacs. If not, see <https://www.gnu.org/licenses/>. */ + +#include <config.h> + +#include <sys/types.h> +#include <sys/stat.h> +#include <sys/param.h> +#include <errno.h> +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> + +#include "buffer.h" +#include "coding.h" +#include "tree_sitter.h" + +/* parser.h defines a macro ADVANCE that conflicts with alloc.c. */ +#include <tree_sitter/parser.h> + +Lisp_Object +make_ts_parser (struct buffer *buffer, TSParser *parser, TSTree *tree) +{ + struct Lisp_TS_Parser *lisp_parser + = ALLOCATE_PLAIN_PSEUDOVECTOR (struct Lisp_TS_Parser, PVEC_TS_PARSER); + lisp_parser->buffer = buffer; + lisp_parser->parser = parser; + lisp_parser->tree = tree; + // TODO TSInput. + return make_lisp_ptr (lisp_parser, Lisp_Vectorlike); +} + +Lisp_Object +make_ts_node (Lisp_Object parser, TSNode node) +{ + struct Lisp_TS_Node *lisp_node + = ALLOCATE_PSEUDOVECTOR (struct Lisp_TS_Node, parser, PVEC_TS_NODE); + lisp_node->parser = parser; + lisp_node->node = node; + return make_lisp_ptr (lisp_node, Lisp_Vectorlike); +} + + +/* Tree-sitter parser. */ + +DEFUN ("tree-sitter-parse", Ftree_sitter_parse, Stree_sitter_parse, + 2, 2, 0, + doc: /* Parse STRING and return a parser object. +LANGUAGE should be the language provided by a tree-sitter language +dynamic module. */) + (Lisp_Object string, Lisp_Object language) +{ + CHECK_STRING (string); + + /* LANGUAGE is a USER_PTR that contains the pointer to a + TSLanguage struct. */ + TSParser *parser = ts_parser_new (); + TSLanguage *lang = (XUSER_PTR (language)->p); + ts_parser_set_language (parser, lang); + + TSTree *tree = ts_parser_parse_string (parser, NULL, + SSDATA (string), + strlen (SSDATA (string))); + + /* See comment for ts_parser_parse in tree_sitter/api.h + for possible reasons for a failure. */ + if (tree == NULL) + signal_error ("Failed to parse STRING", string); + + TSNode root_node = ts_tree_root_node (tree); + + Lisp_Object lisp_parser = make_ts_parser (NULL, parser, tree); + Lisp_Object lisp_node = make_ts_node (lisp_parser, root_node); + + return lisp_node; +} + +DEFUN ("tree-sitter-node-string", + Ftree_sitter_node_string, Stree_sitter_node_string, 1, 1, 0, + doc: /* Return the string representation of NODE. */) + (Lisp_Object node) +{ + TSNode ts_node = XTS_NODE (node)->node; + char *string = ts_node_string(ts_node); + return make_string(string, strlen (string)); +} + +DEFUN ("tree-sitter-node-parent", + Ftree_sitter_node_parent, Stree_sitter_node_parent, 1, 1, 0, + doc: /* Return the immediate parent of NODE. +Return nil if couldn't find any. */) + (Lisp_Object node) +{ + TSNode ts_node = XTS_NODE (node)->node; + TSNode parent = ts_node_parent(ts_node); + + if (ts_node_is_null(parent)) + return Qnil; + + return make_ts_node(XTS_NODE (node)->parser, parent); +} + +DEFUN ("tree-sitter-node-child", + Ftree_sitter_node_child, Stree_sitter_node_child, 2, 2, 0, + doc: /* Return the Nth child of NODE. +Return nil if couldn't find any. */) + (Lisp_Object node, Lisp_Object n) +{ + CHECK_INTEGER (n); + EMACS_INT idx = XFIXNUM (n); + TSNode ts_node = XTS_NODE (node)->node; + // FIXME: Is this cast ok? + TSNode child = ts_node_child(ts_node, (uint32_t) idx); + + if (ts_node_is_null(child)) + return Qnil; + + return make_ts_node(XTS_NODE (node)->parser, child); +} + +/* Initialize the tree-sitter routines. */ +void +syms_of_tree_sitter (void) +{ + defsubr (&Stree_sitter_parse); + defsubr (&Stree_sitter_node_string); + defsubr (&Stree_sitter_node_parent); + defsubr (&Stree_sitter_node_child); +} diff --git a/src/tree_sitter.h b/src/tree_sitter.h new file mode 100644 index 0000000000..3c9e03475f --- /dev/null +++ b/src/tree_sitter.h @@ -0,0 +1,87 @@ +/* Header file for the tree-sitter integration. + +Copyright (C) 2021 Free Software Foundation, Inc. + +This file is part of GNU Emacs. + +GNU Emacs is free software: you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation, either version 3 of the License, or (at +your option) any later version. + +GNU Emacs is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +You should have received a copy of the GNU General Public License +along with GNU Emacs. If not, see <https://www.gnu.org/licenses/>. */ + +#ifndef EMACS_TREE_SITTER_H +#define EMACS_TREE_SITTER_H + +#include <sys/types.h> + +#include "lisp.h" + +#include <tree_sitter/api.h> + +INLINE_HEADER_BEGIN + +struct Lisp_TS_Parser +{ + union vectorlike_header header; + struct buffer *buffer; + TSParser *parser; + TSTree *tree; + TSInput input; +}; + +struct Lisp_TS_Node +{ + union vectorlike_header header; + /* This should prevent the gc from collecting the parser before the + node is done with it. TSNode contains a pointer to the tree it + belongs to, and the parser object, when collected by gc, will + free that tree. */ + Lisp_Object parser; + TSNode node; +}; + +INLINE bool +TS_PARSERP (Lisp_Object x) +{ + return PSEUDOVECTORP (x, PVEC_TS_PARSER); +} + +INLINE struct Lisp_TS_Parser * +XTS_PARSER (Lisp_Object a) +{ + eassert (TS_PARSERP (a)); + return XUNTAG (a, Lisp_Vectorlike, struct Lisp_TS_Parser); +} + +INLINE bool +TS_NODEP (Lisp_Object x) +{ + return PSEUDOVECTORP (x, PVEC_TS_NODE); +} + +INLINE struct Lisp_TS_Node * +XTS_NODE (Lisp_Object a) +{ + eassert (TS_NODEP (a)); + return XUNTAG (a, Lisp_Vectorlike, struct Lisp_TS_Node); +} + +Lisp_Object +make_ts_parser (struct buffer *buffer, TSParser *parser, TSTree *tree); + +Lisp_Object +make_ts_node (Lisp_Object parser, TSNode node); + +extern void syms_of_tree_sitter (void); + +INLINE_HEADER_END + +#endif /* EMACS_TREE_SITTER_H */ -- 2.24.3 (Apple Git-128) [-- Attachment #2.3: Type: text/html, Size: 133 bytes --] [-- Attachment #2.4: json-module.zip --] [-- Type: application/zip, Size: 8797 bytes --] [-- Attachment #2.5: Type: text/html, Size: 184 bytes --] ^ permalink raw reply related [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-15 2:48 ` Yuan Fu @ 2021-07-15 6:39 ` Eli Zaretskii 2021-07-15 13:37 ` Fu Yuan 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-15 6:39 UTC (permalink / raw) To: Yuan Fu; +Cc: monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Wed, 14 Jul 2021 22:48:30 -0400 > Cc: emacs-devel <emacs-devel@gnu.org> > > I defined two pseudo vectors for tree-sitter's parser and node and packaged a dynamic module for tree-sitter’s json language definition. I also wrapped a few tree-sitter functions just to test if everything works. Please have a look. I’m sure there are some problems because I mainly wrote by copy, paste and modifying from other code I found in Emacs source. Thanks, but why does it parse only strings, not buffer text? ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-15 6:39 ` Eli Zaretskii @ 2021-07-15 13:37 ` Fu Yuan 2021-07-15 14:18 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Fu Yuan @ 2021-07-15 13:37 UTC (permalink / raw) To: Eli Zaretskii; +Cc: monnier, emacs-devel > 在 2021年7月15日,上午2:39,Eli Zaretskii <eliz@gnu.org> 写道: > > >> >> From: Yuan Fu <casouri@gmail.com> >> Date: Wed, 14 Jul 2021 22:48:30 -0400 >> Cc: emacs-devel <emacs-devel@gnu.org> >> >> I defined two pseudo vectors for tree-sitter's parser and node and packaged a dynamic module for tree-sitter’s json language definition. I also wrapped a few tree-sitter functions just to test if everything works. Please have a look. I’m sure there are some problems because I mainly wrote by copy, paste and modifying from other code I found in Emacs source. > > Thanks, but why does it parse only strings, not buffer text? I haven’t written it yet. I want to make sure the pseudo vector definition and configure files are right before going further. IIRC the contribution guide recommends sending small patches and update along the way. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-15 13:37 ` Fu Yuan @ 2021-07-15 14:18 ` Eli Zaretskii 2021-07-15 15:17 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-15 14:18 UTC (permalink / raw) To: Fu Yuan; +Cc: monnier, emacs-devel > From: Fu Yuan <casouri@gmail.com> > Date: Thu, 15 Jul 2021 09:37:27 -0400 > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org > > > Thanks, but why does it parse only strings, not buffer text? > > I haven’t written it yet. I want to make sure the pseudo vector definition and configure files are right before going further. IIRC the contribution guide recommends sending small patches and update along the way. Great, then please try also to liberate the implementation from using JSON, it's a major slowdown factor. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-15 14:18 ` Eli Zaretskii @ 2021-07-15 15:17 ` Yuan Fu 2021-07-15 15:50 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-07-15 15:17 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Stefan Monnier, emacs-devel > On Jul 15, 2021, at 10:18 AM, Eli Zaretskii <eliz@gnu.org> wrote: > >> From: Fu Yuan <casouri@gmail.com> >> Date: Thu, 15 Jul 2021 09:37:27 -0400 >> Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org >> >>> Thanks, but why does it parse only strings, not buffer text? >> >> I haven’t written it yet. I want to make sure the pseudo vector definition and configure files are right before going further. IIRC the contribution guide recommends sending small patches and update along the way. > > Great, then please try also to liberate the implementation from using > JSON, it's a major slowdown factor. JSON? I didn’t write anything involving JSON. While you are looking at the patch, here are some questions for integrating tree-sitter with out buffer implementation. What I envisioned is for each buffer to have a `parser-list’, and on buffer change, we update each parser’s tree. I think modifying signal_after_change is enough to cover al the cases? And, for tree-sitter to take the buffer’s content directly, we need to tell it to skip the gap. I only need to modify gap_left, gap_right, make_gap_smaller and make_gap_larger, right? Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-15 15:17 ` Yuan Fu @ 2021-07-15 15:50 ` Eli Zaretskii 2021-07-15 16:19 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-15 15:50 UTC (permalink / raw) To: Yuan Fu; +Cc: monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Thu, 15 Jul 2021 11:17:02 -0400 > Cc: Stefan Monnier <monnier@iro.umontreal.ca>, > emacs-devel@gnu.org > > > Great, then please try also to liberate the implementation from using > > JSON, it's a major slowdown factor. > > JSON? I didn’t write anything involving JSON. Then what is json-module.zip about? > While you are looking at the patch, here are some questions for integrating tree-sitter with out buffer implementation. What I envisioned is for each buffer to have a `parser-list’, and on buffer change, we update each parser’s tree. I think modifying signal_after_change is enough to cover al the cases? Why do you need to do this when a buffer is updated? why not use display as the trigger? Large portions of a buffer will never be displayed, and some buffers will not be displayed at all. Why waste cycles on them? Redisplay is perfectly equipped to tell you when some chunk of buffer text is going to be redrawn, and it already knows to do nothing if the buffer haven't changed. > And, for tree-sitter to take the buffer’s content directly, we need to tell it to skip the gap. AFAIR, tree-sitter allows the calling package to provide a function to access the text, isn't that so? If so, you could write a function that accesses buffer text via BYTE_POS_ADDR etc., and that knows how to skip the gap already. > I only need to modify gap_left, gap_right, make_gap_smaller and make_gap_larger, right? Why would you need to _modify_ any of these? ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-15 15:50 ` Eli Zaretskii @ 2021-07-15 16:19 ` Yuan Fu 2021-07-15 16:26 ` Yuan Fu 2021-07-15 16:48 ` Eli Zaretskii 0 siblings, 2 replies; 370+ messages in thread From: Yuan Fu @ 2021-07-15 16:19 UTC (permalink / raw) To: Eli Zaretskii; +Cc: monnier, emacs-devel > On Jul 15, 2021, at 11:50 AM, Eli Zaretskii <eliz@gnu.org> wrote: > >> From: Yuan Fu <casouri@gmail.com> >> Date: Thu, 15 Jul 2021 11:17:02 -0400 >> Cc: Stefan Monnier <monnier@iro.umontreal.ca>, >> emacs-devel@gnu.org >> >>> Great, then please try also to liberate the implementation from using >>> JSON, it's a major slowdown factor. >> >> JSON? I didn’t write anything involving JSON. > > Then what is json-module.zip about? That’s a language definition for tree-sitter, so it tells tree-sitter how to parse a JSON file. There are definitions for Python, Ruby, C, etc. I just used JSON for an example. It’s named json-module because it is a dynamic module. > >> While you are looking at the patch, here are some questions for integrating tree-sitter with out buffer implementation. What I envisioned is for each buffer to have a `parser-list’, and on buffer change, we update each parser’s tree. I think modifying signal_after_change is enough to cover al the cases? > > Why do you need to do this when a buffer is updated? why not use > display as the trigger? Large portions of a buffer will never be > displayed, and some buffers will not be displayed at all. Why waste > cycles on them? Redisplay is perfectly equipped to tell you when some > chunk of buffer text is going to be redrawn, and it already knows to > do nothing if the buffer haven't changed. Tree-sitter expects you to tell it every single change to the parsed text. Say you have a buffer with some content and scrolled through it, so tree-sitter has parsed the whole buffer. Then some elisp edited some text outside the visible portion. Redisplay doesn’t happen, we don’t tell this edit to tree-sitter. Then I scroll to the place that has been edited. What now? I’ve lost the change information, and tree-sitter’s tree is out-dated. We can fontify on-demand, but we can’t parse on-demand. What we can do is to only parse the portion from BOB to the visible portion. So we won’t parse the whole buffer unless you scroll to the bottom. > >> And, for tree-sitter to take the buffer’s content directly, we need to tell it to skip the gap. > > AFAIR, tree-sitter allows the calling package to provide a function to > access the text, isn't that so? If so, you could write a function > that accesses buffer text via BYTE_POS_ADDR etc., and that knows how > to skip the gap already. Yes, that function returns a char*. But what if the gap is in the middle of the portion that tree-sitter wants to read? Alternatively, we can copy the text out and pass it to tree-sitter, but you don’t like that, IIRC. > >> I only need to modify gap_left, gap_right, make_gap_smaller and make_gap_larger, right? > > Why would you need to _modify_ any of these? Because I want to let tree-sitter to know where is the gap so it can avoid it when reading text. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-15 16:19 ` Yuan Fu @ 2021-07-15 16:26 ` Yuan Fu 2021-07-15 16:50 ` Eli Zaretskii 2021-07-15 16:48 ` Eli Zaretskii 1 sibling, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-07-15 16:26 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Stefan Monnier, emacs-devel [-- Attachment #1: Type: text/plain, Size: 830 bytes --] > >> >>> And, for tree-sitter to take the buffer’s content directly, we need to tell it to skip the gap. >> >> AFAIR, tree-sitter allows the calling package to provide a function to >> access the text, isn't that so? If so, you could write a function >> that accesses buffer text via BYTE_POS_ADDR etc., and that knows how >> to skip the gap already. > > Yes, that function returns a char*. But what if the gap is in the middle of the portion that tree-sitter wants to read? Alternatively, we can copy the text out and pass it to tree-sitter, but you don’t like that, IIRC. Or we can only copy out when the portion tree-sitter wants encompasses the gap, I expect this case to be relatively rare so we won’t copy out all the time, and most of the time tree-sitter just reads from the buffer directly. Yuan [-- Attachment #2: Type: text/html, Size: 3096 bytes --] ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-15 16:26 ` Yuan Fu @ 2021-07-15 16:50 ` Eli Zaretskii 0 siblings, 0 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-07-15 16:50 UTC (permalink / raw) To: Yuan Fu; +Cc: monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Thu, 15 Jul 2021 12:26:25 -0400 > Cc: Stefan Monnier <monnier@iro.umontreal.ca>, > emacs-devel@gnu.org > > Or we can only copy out when the portion tree-sitter wants encompasses the gap, I expect this case to be > relatively rare so we won’t copy out all the time, and most of the time tree-sitter just reads from the buffer > directly. Actually, I expect this to happen quite frequently, because the gap is usually where the editing happens. We could, of course, move the gap out of the way temporarily, but that's somewhat expensive, so it is better to avoid it. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-15 16:19 ` Yuan Fu 2021-07-15 16:26 ` Yuan Fu @ 2021-07-15 16:48 ` Eli Zaretskii 2021-07-15 18:23 ` Yuan Fu 2021-07-20 16:25 ` Stephen Leake 1 sibling, 2 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-07-15 16:48 UTC (permalink / raw) To: Yuan Fu; +Cc: monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Thu, 15 Jul 2021 12:19:31 -0400 > Cc: monnier@iro.umontreal.ca, > emacs-devel@gnu.org > > > Why do you need to do this when a buffer is updated? why not use > > display as the trigger? Large portions of a buffer will never be > > displayed, and some buffers will not be displayed at all. Why waste > > cycles on them? Redisplay is perfectly equipped to tell you when some > > chunk of buffer text is going to be redrawn, and it already knows to > > do nothing if the buffer haven't changed. > > Tree-sitter expects you to tell it every single change to the parsed text. That cannot be true, because the parsed text could be in a state where parsing it will fail. When you are in the middle of writing the code, this is what will happen many times, even if you pass the whole buffer to the parser. And since tree-sitter _must_ be able to deal with this problem, it also must be able to receive incomplete parts of the buffer text, and do the best it can with it. > Say you have a buffer with some content and scrolled through it, so tree-sitter has parsed the whole buffer. Then some elisp edited some text outside the visible portion. Redisplay doesn’t happen, we don’t tell this edit to tree-sitter. Then I scroll to the place that has been edited. What now? Now you call tree-sitter passing it the part of the buffer that needs to be parsed (e.g., the chunk that is about to be displayed). If tree-sitter needs to look back, it will. > I’ve lost the change information, and tree-sitter’s tree is out-dated. No information is lost because the updated buffer text is available. > We can fontify on-demand, but we can’t parse on-demand. Sorry, I don't believe this is true. tree-sitter _must_ be able to deal with these situations, because it must be able to deal with incomplete text that cannot be parsed without parse errors. In addition, Emacs records (for redisplay purposes) two places in each buffer related to changes: the minimum buffer position before which no changes were done since last redisplay, and the maximum buffer position beyond which there were no changes. This can also be used to pass only a small part of the buffer to the parser, because the rest didn't change. > What we can do is to only parse the portion from BOB to the visible portion. So we won’t parse the whole buffer unless you scroll to the bottom. My primary worry is the fact that you want to use buffer-change hooks (and will soon enough want to use post-command-hook as well). They slow down editing, sometimes tremendously, so I'd very much prefer not to use those hooks for fontification/parsing. The original font-lock mechanism in Emacs 19 used these hooks; we switched to jit-lock and its redisplay-triggered fontifications because the original design had problems which couldn't be solved reliably and with reasonable performance. I hope we will not make the mistake of going back to that sub-optimal design. > >> And, for tree-sitter to take the buffer’s content directly, we need to tell it to skip the gap. > > > > AFAIR, tree-sitter allows the calling package to provide a function to > > access the text, isn't that so? If so, you could write a function > > that accesses buffer text via BYTE_POS_ADDR etc., and that knows how > > to skip the gap already. > > Yes, that function returns a char*. But what if the gap is in the middle of the portion that tree-sitter wants to read? If you provide the function that returns text one character at a time, as AFAIR tree-sitter allows, you will be able to skip the gap automagically by using BYTE_POS_ADDR. If that's not possible for some reason, or not performant enough, we could ask tree-sitter developers to add an API that access buffer text in two chunks, in which case it will be called first with text before the gap, and then with text after the gap. Like we do when we call regex search functions. > Alternatively, we can copy the text out and pass it to tree-sitter, but you don’t like that, IIRC. Yes, because it means memory allocation, which could be slow, especially for large buffers. It could even fail if the buffer is large enough and the system is under memory pressure. > >> I only need to modify gap_left, gap_right, make_gap_smaller and make_gap_larger, right? > > > > Why would you need to _modify_ any of these? > > Because I want to let tree-sitter to know where is the gap so it can avoid it when reading text. Knowing where is the gap doesn't need any changes to these functions. See GPT_BYTE, GPT_SIZE, BUF_GPT_BYTE, and BUF_GPT_SIZE. And the gap cannot move while tree-sitter accesses the buffer, because no other part of the Lisp machine can run at that time. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-15 16:48 ` Eli Zaretskii @ 2021-07-15 18:23 ` Yuan Fu 2021-07-16 7:30 ` Eli Zaretskii 2021-07-20 16:27 ` Stephen Leake 2021-07-20 16:25 ` Stephen Leake 1 sibling, 2 replies; 370+ messages in thread From: Yuan Fu @ 2021-07-15 18:23 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Stefan Monnier, emacs-devel > On Jul 15, 2021, at 12:48 PM, Eli Zaretskii <eliz@gnu.org> wrote: > >> From: Yuan Fu <casouri@gmail.com> >> Date: Thu, 15 Jul 2021 12:19:31 -0400 >> Cc: monnier@iro.umontreal.ca, >> emacs-devel@gnu.org >> >>> Why do you need to do this when a buffer is updated? why not use >>> display as the trigger? Large portions of a buffer will never be >>> displayed, and some buffers will not be displayed at all. Why waste >>> cycles on them? Redisplay is perfectly equipped to tell you when some >>> chunk of buffer text is going to be redrawn, and it already knows to >>> do nothing if the buffer haven't changed. >> >> Tree-sitter expects you to tell it every single change to the parsed text. > > That cannot be true, because the parsed text could be in a state where > parsing it will fail. When you are in the middle of writing the code, > this is what will happen many times, even if you pass the whole buffer > to the parser. And since tree-sitter _must_ be able to deal with this > problem, it also must be able to receive incomplete parts of the > buffer text, and do the best it can with it. > >> Say you have a buffer with some content and scrolled through it, so tree-sitter has parsed the whole buffer. Then some elisp edited some text outside the visible portion. Redisplay doesn’t happen, we don’t tell this edit to tree-sitter. Then I scroll to the place that has been edited. What now? > > Now you call tree-sitter passing it the part of the buffer that needs > to be parsed (e.g., the chunk that is about to be displayed). If > tree-sitter needs to look back, it will. > >> I’ve lost the change information, and tree-sitter’s tree is out-dated. > > No information is lost because the updated buffer text is available. > >> We can fontify on-demand, but we can’t parse on-demand. > > Sorry, I don't believe this is true. tree-sitter _must_ be able to > deal with these situations, because it must be able to deal with > incomplete text that cannot be parsed without parse errors. > I think my assertion was too strong. By “can’t parse on-demand” I mean we can’t easily pass tree-sitter a random chunk of text and not letting it to parse from BOB. > In addition, Emacs records (for redisplay purposes) two places in each > buffer related to changes: the minimum buffer position before which no > changes were done since last redisplay, and the maximum buffer > position beyond which there were no changes. This can also be used to > pass only a small part of the buffer to the parser, because the rest > didn't change. > >> What we can do is to only parse the portion from BOB to the visible portion. So we won’t parse the whole buffer unless you scroll to the bottom. > > My primary worry is the fact that you want to use buffer-change hooks > (and will soon enough want to use post-command-hook as well). They > slow down editing, sometimes tremendously, so I'd very much prefer not > to use those hooks for fontification/parsing. The original font-lock > mechanism in Emacs 19 used these hooks; we switched to jit-lock and > its redisplay-triggered fontifications because the original design had > problems which couldn't be solved reliably and with reasonable > performance. I hope we will not make the mistake of going back to > that sub-optimal design. I understand. I want to point out that parsing is separated from fontification, and syntax-pass flushes its cache in before-change-hook. I was hoping to use the parse tree for more than fontification, e.g., motion commands like sexp-forward/backward or structural editing commands like expand-region. Another scenario: some elisp edited some text before the visible portion, the tree is not updated, now I want to select the node at point (like expand-region), I look for the leave node that contains the byte position of point. However, because the tree is out-dated, the byte position of point will not correspond to the node I want. We can still fontify with jit-lock, it’s just parsing cannot easily work like fontification, I expect tree-sitter to work similarly to syntax-pass rather than jit-lock. > >>>> And, for tree-sitter to take the buffer’s content directly, we need to tell it to skip the gap. >>> >>> AFAIR, tree-sitter allows the calling package to provide a function to >>> access the text, isn't that so? If so, you could write a function >>> that accesses buffer text via BYTE_POS_ADDR etc., and that knows how >>> to skip the gap already. >> >> Yes, that function returns a char*. But what if the gap is in the middle of the portion that tree-sitter wants to read? > > If you provide the function that returns text one character at a time, > as AFAIR tree-sitter allows, you will be able to skip the gap > automagically by using BYTE_POS_ADDR. If that's not possible for some > reason, or not performant enough, we could ask tree-sitter developers > to add an API that access buffer text in two chunks, in which case it > will be called first with text before the gap, and then with text > after the gap. Like we do when we call regex search functions. Yes, I make a mistake reading the api. Indeed we can read one character at a time, and gap is not an issue anymore. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-15 18:23 ` Yuan Fu @ 2021-07-16 7:30 ` Eli Zaretskii 2021-07-16 14:27 ` Yuan Fu 2021-07-20 16:28 ` Stephen Leake 2021-07-20 16:27 ` Stephen Leake 1 sibling, 2 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-07-16 7:30 UTC (permalink / raw) To: Yuan Fu; +Cc: monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Thu, 15 Jul 2021 14:23:02 -0400 > Cc: Stefan Monnier <monnier@iro.umontreal.ca>, > emacs-devel@gnu.org > > >> Say you have a buffer with some content and scrolled through it, so tree-sitter has parsed the whole buffer. Then some elisp edited some text outside the visible portion. Redisplay doesn’t happen, we don’t tell this edit to tree-sitter. Then I scroll to the place that has been edited. What now? > > > > Now you call tree-sitter passing it the part of the buffer that needs > > to be parsed (e.g., the chunk that is about to be displayed). If > > tree-sitter needs to look back, it will. > > > >> I’ve lost the change information, and tree-sitter’s tree is out-dated. > > > > No information is lost because the updated buffer text is available. > > > >> We can fontify on-demand, but we can’t parse on-demand. > > > > Sorry, I don't believe this is true. tree-sitter _must_ be able to > > deal with these situations, because it must be able to deal with > > incomplete text that cannot be parsed without parse errors. > > > I think my assertion was too strong. By “can’t parse on-demand” I mean we can’t easily pass tree-sitter a random chunk of text and not letting it to parse from BOB. You must start from BOB only in languages that require that; not every language does. And even with languages that require starting from BOB, you could do that only once, the first time a buffer needs parsing; thereafter, you can only pass to tree-sitter the parts that were changed since the last time. Emacs records that information for the display engine, see BEG_UNCHANGED and END_UNCHANGED. If that is not enough, we could record more information about changes to buffer text. The main issue here is to pass the buffer text to tree-sitter lazily, only when and as much as needed. > I understand. I want to point out that parsing is separated from fontification, and syntax-pass flushes its cache in before-change-hook. I was hoping to use the parse tree for more than fontification, e.g., motion commands like sexp-forward/backward or structural editing commands like expand-region. Another scenario: some elisp edited some text before the visible portion, the tree is not updated, now I want to select the node at point (like expand-region), I look for the leave node that contains the byte position of point. However, because the tree is out-dated, the byte position of point will not correspond to the node I want. Each command/feature that needs an updated TS tree will take care of updating TS with the relevant information. We should record whatever we need for that as side effect of primitives that change buffer text (in insdel.c), and use the recorded info to update TS. But the actual passing of text to TS should happen lazily, when we actually need its re-parsing, not when the changes to buffer text are done. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-16 7:30 ` Eli Zaretskii @ 2021-07-16 14:27 ` Yuan Fu 2021-07-16 14:33 ` Stefan Monnier 2021-07-16 15:27 ` Eli Zaretskii 2021-07-20 16:28 ` Stephen Leake 1 sibling, 2 replies; 370+ messages in thread From: Yuan Fu @ 2021-07-16 14:27 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Stefan Monnier, emacs-devel > > Each command/feature that needs an updated TS tree will take care of > updating TS with the relevant information. We should record whatever > we need for that as side effect of primitives that change buffer text > (in insdel.c), and use the recorded info to update TS. But the actual > passing of text to TS should happen lazily, when we actually need its > re-parsing, not when the changes to buffer text are done. Ok, I will write it like that. Another question, how do I add a new field in struct buffer? I tried to add Lisp_Object ts_parser_list_; Before Lisp_Object cursor_in_non_selected_windows_; But that wouldn't dump. I want to put the parsers in a field rather than in a buffer local variable because I don’t want users to add/remove parsers from this list freely, otherwise the parsers could go out of sync. I plan to provide functions like add-parser, remove-parser, buffer-parser-list for users to access this list. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-16 14:27 ` Yuan Fu @ 2021-07-16 14:33 ` Stefan Monnier 2021-07-16 14:53 ` Yuan Fu 2021-07-16 15:27 ` Eli Zaretskii 1 sibling, 1 reply; 370+ messages in thread From: Stefan Monnier @ 2021-07-16 14:33 UTC (permalink / raw) To: Yuan Fu; +Cc: Eli Zaretskii, emacs-devel > I want to put the parsers in a field rather than in a buffer local variable > because I don’t want users to add/remove parsers from this list freely, > otherwise the parsers could go out of sync. I wouldn't worry 'bout that: Emacs generally doesn't try to stop people shooting themselves in the foot. So we want to provide a convenient and safe API but we don't have to hide its inner workings. Stefan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-16 14:33 ` Stefan Monnier @ 2021-07-16 14:53 ` Yuan Fu 0 siblings, 0 replies; 370+ messages in thread From: Yuan Fu @ 2021-07-16 14:53 UTC (permalink / raw) To: Stefan Monnier; +Cc: Eli Zaretskii, emacs-devel > On Jul 16, 2021, at 10:33 AM, Stefan Monnier <monnier@iro.umontreal.ca> wrote: > >> I want to put the parsers in a field rather than in a buffer local variable >> because I don’t want users to add/remove parsers from this list freely, >> otherwise the parsers could go out of sync. > > I wouldn't worry 'bout that: Emacs generally doesn't try to stop people > shooting themselves in the foot. I should’ve figured that out by now ;-) > So we want to provide a convenient and > safe API but we don't have to hide its inner workings. > Ok, local variable then. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-16 14:27 ` Yuan Fu 2021-07-16 14:33 ` Stefan Monnier @ 2021-07-16 15:27 ` Eli Zaretskii 2021-07-16 15:51 ` Yuan Fu 1 sibling, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-16 15:27 UTC (permalink / raw) To: Yuan Fu; +Cc: monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Fri, 16 Jul 2021 10:27:36 -0400 > Cc: Stefan Monnier <monnier@iro.umontreal.ca>, > emacs-devel@gnu.org > > Another question, how do I add a new field in struct buffer? I tried to add > > Lisp_Object ts_parser_list_; > > Before > > Lisp_Object cursor_in_non_selected_windows_; > > But that wouldn't dump. Did you see in init_buffer_once what we do with built-in fields of struct buffer? ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-16 15:27 ` Eli Zaretskii @ 2021-07-16 15:51 ` Yuan Fu 2021-07-17 2:05 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-07-16 15:51 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Stefan Monnier, emacs-devel > On Jul 16, 2021, at 11:27 AM, Eli Zaretskii <eliz@gnu.org> wrote: > >> From: Yuan Fu <casouri@gmail.com> >> Date: Fri, 16 Jul 2021 10:27:36 -0400 >> Cc: Stefan Monnier <monnier@iro.umontreal.ca>, >> emacs-devel@gnu.org >> >> Another question, how do I add a new field in struct buffer? I tried to add >> >> Lisp_Object ts_parser_list_; >> >> Before >> >> Lisp_Object cursor_in_non_selected_windows_; >> >> But that wouldn't dump. > > Did you see in init_buffer_once what we do with built-in fields of > struct buffer? I did not, that must be why, thanks. Though I’ve changed to use a buffer-local variable as Stefan suggested. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-16 15:51 ` Yuan Fu @ 2021-07-17 2:05 ` Yuan Fu 2021-07-17 2:23 ` Clément Pit-Claudel 2021-07-17 6:56 ` Eli Zaretskii 0 siblings, 2 replies; 370+ messages in thread From: Yuan Fu @ 2021-07-17 2:05 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Stefan Monnier, emacs-devel [-- Attachment #1: Type: text/plain, Size: 2008 bytes --] Please have a look at the second patch that applies on top of the first one. This time I added after-change hooks, so if you create a parser for a buffer and edit that buffer, the parser is kept updated lazily. In summary, the parser parses the whole buffer on the first time when the user asks for the parse tree. In after-change-hook, no parsing is done, but we do update the trees with position changes. On the next time when the user asks for the parse tree, the whole buffer is re-parsed incrementally. (I didn’t read the paper, but I assume it knows where are the bits to re-parse because we updated the tree with position changes.) Maybe this is not lazy enough, and I should do a benchmark. This is a simple benchmark that I did: Benchmark 1: 22M json file, opened in literary mode, try parse the whole buffer, took 17s and uses 3G memory. Benchmark2: 1.6M json file, opened in fundamental mode, first parsed the whole buffer, took 1.039s, no gc. Then ran this: (benchmark-run 1000 (dotimes (_ 1000) (insert "1,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,\n")) (dotimes (_ 1000) (backward-delete-char (length "1,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,\n")))) Result: (39.302071 8 4.3011029999999995) and many gc trimming. Then removes the parser, ran again, Result: (33.589416 8 4.405495999999999) No parsing is done in either run (because parsing is lazy, and I didn’t ask for the parse tree). The only difference is that, in the first run, after-change-hook updates the tree with position change. My conclusion is that after-change-hook is pretty insignificant, and the initial parse is a bit slow (on large files). I’m running this on a 1.4 GHz Quad-Core Intel Core i5 with 16G memory. Of course, I’m open to suggestions for a better benchmark. The amateur log of the benchmark is in benchmark.el. The json file I used in the second benchmark is benchmark.2.json. The patch is ts.2.patch. [-- Attachment #2: ts.2.patch --] [-- Type: application/octet-stream, Size: 11087 bytes --] From 180aea41cdce11b9b4bdc7da0964c14c0bf8a5f0 Mon Sep 17 00:00:00 2001 From: Yuan Fu <casouri@gmail.com> Date: Fri, 16 Jul 2021 21:11:29 -0400 Subject: [PATCH] checkpoint 2: add change-hooks --- src/insdel.c | 16 +++++ src/tree_sitter.c | 163 ++++++++++++++++++++++++++++++++++++++++++++-- src/tree_sitter.h | 10 +++ 3 files changed, 182 insertions(+), 7 deletions(-) diff --git a/src/insdel.c b/src/insdel.c index e38b091f54..3c1e13d38b 100644 --- a/src/insdel.c +++ b/src/insdel.c @@ -31,6 +31,10 @@ #include "region-cache.h" #include "pdumper.h" +#ifdef HAVE_TREE_SITTER +#include "tree_sitter.h" +#endif + static void insert_from_string_1 (Lisp_Object, ptrdiff_t, ptrdiff_t, ptrdiff_t, ptrdiff_t, bool, bool); static void insert_from_buffer_1 (struct buffer *, ptrdiff_t, ptrdiff_t, bool); @@ -2152,6 +2156,11 @@ signal_before_change (ptrdiff_t start_int, ptrdiff_t end_int, run_hook (Qfirst_change_hook); } +#ifdef HAVE_TREE_SITTER + /* FIXME: Is this the best place? */ + ts_before_change (start_int, end_int); +#endif + /* Now run the before-change-functions if any. */ if (!NILP (Vbefore_change_functions)) { @@ -2205,6 +2214,13 @@ signal_after_change (ptrdiff_t charpos, ptrdiff_t lendel, ptrdiff_t lenins) if (inhibit_modification_hooks) return; +#ifdef HAVE_TREE_SITTER + /* We disrespect combine-after-change, because if we don't record + this change, the information that we need (the end byte position + of the change) will be lost. */ + ts_after_change (charpos, lendel, lenins); +#endif + /* If we are deferring calls to the after-change functions and there are no before-change functions, just record the args that we were going to use. */ diff --git a/src/tree_sitter.c b/src/tree_sitter.c index f2134c571a..7d1225161c 100644 --- a/src/tree_sitter.c +++ b/src/tree_sitter.c @@ -27,6 +27,7 @@ Copyright (C) 2021 Free Software Foundation, Inc. #include <stdlib.h> #include <unistd.h> +#include "lisp.h" #include "buffer.h" #include "coding.h" #include "tree_sitter.h" @@ -34,6 +35,98 @@ Copyright (C) 2021 Free Software Foundation, Inc. /* parser.h defines a macro ADVANCE that conflicts with alloc.c. */ #include <tree_sitter/parser.h> +/* Record the byte position of the end of the (to-be) changed text. +We have to record it now, because by the time we get to after-change +hook, the _byte_ position of the end is lost. */ +void +ts_before_change (ptrdiff_t start_int, ptrdiff_t end_int) +{ + /* Iterate through each parser in 'tree-sitter-parser-list' and + record the byte position. There could be better ways to record + it than storing the same position in every parser, but this is + the most fool-proof way, and I expect a buffer to have only one + parser most of the time anyway. */ + ptrdiff_t beg_byte = CHAR_TO_BYTE (start_int); + ptrdiff_t old_end_byte = CHAR_TO_BYTE (end_int); + Lisp_Object parser_list = Fsymbol_value (Qtree_sitter_parser_list); + while (!NILP (parser_list)) + { + Lisp_Object lisp_parser = Fcar (parser_list); + XTS_PARSER (lisp_parser)->edit.start_byte = beg_byte; + XTS_PARSER (lisp_parser)->edit.old_end_byte = old_end_byte; + parser_list = Fcdr (parser_list); + } +} + +/* Update each parser's tree after the user made an edit. This +function does not parse the buffer and only updates the tree. (So it +should be very fast.) */ +void +ts_after_change (ptrdiff_t charpos, ptrdiff_t lendel, ptrdiff_t lenins) +{ + ptrdiff_t new_end_byte = CHAR_TO_BYTE (charpos + lenins); + Lisp_Object parser_list = Fsymbol_value (Qtree_sitter_parser_list); + while (!NILP (parser_list)) + { + Lisp_Object lisp_parser = Fcar (parser_list); + TSTree *tree = XTS_PARSER (lisp_parser)->tree; + XTS_PARSER (lisp_parser)->edit.new_end_byte = new_end_byte; + if (tree != NULL) + ts_tree_edit (tree, &XTS_PARSER (lisp_parser)->edit); + parser_list = Fcdr (parser_list); + } +} + +/* Parse the buffer. We don't parse until we have to. When we have +to, we call this function to parse and update the tree. */ +void +ts_ensure_parsed (Lisp_Object parser) +{ + TSParser *ts_parser = XTS_PARSER (parser)->parser; + TSTree *tree = XTS_PARSER(parser)->tree; + TSInput input = XTS_PARSER (parser)->input; + TSTree *new_tree = ts_parser_parse(ts_parser, tree, input); + XTS_PARSER (parser)->tree = new_tree; +} + +/* This is the read function provided to tree-sitter to read from a + buffer. It reads one character at a time and automatically skip + the gap. */ +const char* +ts_read_buffer (void *buffer, uint32_t byte_index, + TSPoint position, uint32_t *bytes_read) +{ + if (! BUFFER_LIVE_P ((struct buffer *) buffer)) + error ("BUFFER is not live"); + + ptrdiff_t byte_pos = byte_index + 1; + + // FIXME: Add some boundary checks? + /* I believe we can get away with only setting current-buffer + and not actually switching to it, like what we did in + 'make_gap_1'. */ + struct buffer *old_buffer = current_buffer; + current_buffer = (struct buffer *) buffer; + + /* Read one character. */ + char *beg; + int len; + if (byte_pos >= Z_BYTE) + { + beg = ""; + len = 0; + } + else + { + beg = (char *) BYTE_POS_ADDR (byte_pos); + len = next_char_len(byte_pos); + } + *bytes_read = (uint32_t) len; + current_buffer = old_buffer; + return beg; +} + +/* Wrap the parser in a Lisp_Object to be used in the Lisp machine. */ Lisp_Object make_ts_parser (struct buffer *buffer, TSParser *parser, TSTree *tree) { @@ -42,10 +135,15 @@ make_ts_parser (struct buffer *buffer, TSParser *parser, TSTree *tree) lisp_parser->buffer = buffer; lisp_parser->parser = parser; lisp_parser->tree = tree; - // TODO TSInput. + TSInput input = {buffer, ts_read_buffer, TSInputEncodingUTF8}; + lisp_parser->input = input; + TSPoint dummy_point = {0, 0}; + TSInputEdit edit = {0, 0, 0, dummy_point, dummy_point, dummy_point}; + lisp_parser->edit = edit; return make_lisp_ptr (lisp_parser, Lisp_Vectorlike); } +/* Wrap the node in a Lisp_Object to be used in the Lisp machine. */ Lisp_Object make_ts_node (Lisp_Object parser, TSNode node) { @@ -57,19 +155,59 @@ make_ts_node (Lisp_Object parser, TSNode node) } -/* Tree-sitter parser. */ +DEFUN ("tree-sitter-create-parser", + Ftree_sitter_create_parser, Stree_sitter_create_parser, + 2, 2, 0, + doc: /* Create and return a parser in BUFFER for LANGUAGE. +The parser is automatically added to BUFFER's +`tree-sitter-parser-list'. LANGUAGE should be the language provided +by a tree-sitter language dynamic module. */) + (Lisp_Object buffer, Lisp_Object language) +{ + CHECK_BUFFER(buffer); + + /* LANGUAGE is a USER_PTR that contains the pointer to a TSLanguage + struct. */ + TSParser *parser = ts_parser_new (); + TSLanguage *lang = (XUSER_PTR (language)->p); + ts_parser_set_language (parser, lang); + + Lisp_Object lisp_parser + = make_ts_parser (XBUFFER(buffer), parser, NULL); + + // FIXME: Is this the correct way to set a buffer-local variable? + struct buffer *old_buffer = current_buffer; + set_buffer_internal (XBUFFER (buffer)); + + Fset (Qtree_sitter_parser_list, + Fcons (lisp_parser, Fsymbol_value (Qtree_sitter_parser_list))); + + set_buffer_internal (old_buffer); + return lisp_parser; +} + +DEFUN ("tree-sitter-parser-root-node", + Ftree_sitter_parser_root_node, Stree_sitter_parser_root_node, + 1, 1, 0, + doc: /* Return the root node of PARSER. */) + (Lisp_Object parser) +{ + ts_ensure_parsed(parser); + TSNode root_node = ts_tree_root_node (XTS_PARSER (parser)->tree); + return make_ts_node (parser, root_node); +} DEFUN ("tree-sitter-parse", Ftree_sitter_parse, Stree_sitter_parse, 2, 2, 0, - doc: /* Parse STRING and return a parser object. + doc: /* Parse STRING and return the root node. LANGUAGE should be the language provided by a tree-sitter language dynamic module. */) (Lisp_Object string, Lisp_Object language) { CHECK_STRING (string); - /* LANGUAGE is a USER_PTR that contains the pointer to a - TSLanguage struct. */ + /* LANGUAGE is a USER_PTR that contains the pointer to a TSLanguage + struct. */ TSParser *parser = ts_parser_new (); TSLanguage *lang = (XUSER_PTR (language)->p); ts_parser_set_language (parser, lang); @@ -104,7 +242,7 @@ DEFUN ("tree-sitter-node-string", DEFUN ("tree-sitter-node-parent", Ftree_sitter_node_parent, Stree_sitter_node_parent, 1, 1, 0, doc: /* Return the immediate parent of NODE. -Return nil if couldn't find any. */) +Return nil if we couldn't find any. */) (Lisp_Object node) { TSNode ts_node = XTS_NODE (node)->node; @@ -119,7 +257,7 @@ DEFUN ("tree-sitter-node-parent", DEFUN ("tree-sitter-node-child", Ftree_sitter_node_child, Stree_sitter_node_child, 2, 2, 0, doc: /* Return the Nth child of NODE. -Return nil if couldn't find any. */) +Return nil if we couldn't find any. */) (Lisp_Object node, Lisp_Object n) { CHECK_INTEGER (n); @@ -138,6 +276,17 @@ DEFUN ("tree-sitter-node-child", void syms_of_tree_sitter (void) { + DEFSYM (Qtree_sitter_parser_list, "tree-sitter-parser-list"); + DEFVAR_LISP ("ts-parser-list", Vtree_sitter_parser_list, + doc: /* A list of tree-sitter parsers. +// TODO: more doc. +If you removed a parser from this list, do not put it back in. */); + Vtree_sitter_parser_list = Qnil; + Fmake_variable_buffer_local (Qtree_sitter_parser_list); + + + defsubr (&Stree_sitter_create_parser); + defsubr (&Stree_sitter_parser_root_node); defsubr (&Stree_sitter_parse); defsubr (&Stree_sitter_node_string); defsubr (&Stree_sitter_node_parent); diff --git a/src/tree_sitter.h b/src/tree_sitter.h index 3c9e03475f..0606f336cc 100644 --- a/src/tree_sitter.h +++ b/src/tree_sitter.h @@ -28,6 +28,8 @@ #define EMACS_TREE_SITTER_H INLINE_HEADER_BEGIN +/* A wrapper for a tree-sitter parser, but also contains a parse tree + and other goodies for convenience. */ struct Lisp_TS_Parser { union vectorlike_header header; @@ -35,8 +37,10 @@ #define EMACS_TREE_SITTER_H TSParser *parser; TSTree *tree; TSInput input; + TSInputEdit edit; }; +/* A wrapper around a tree-sitter node. */ struct Lisp_TS_Node { union vectorlike_header header; @@ -74,6 +78,12 @@ XTS_NODE (Lisp_Object a) return XUNTAG (a, Lisp_Vectorlike, struct Lisp_TS_Node); } +void +ts_before_change (ptrdiff_t charpos, ptrdiff_t lendel); + +void +ts_after_change (ptrdiff_t charpos, ptrdiff_t lendel, ptrdiff_t lenins); + Lisp_Object make_ts_parser (struct buffer *buffer, TSParser *parser, TSTree *tree); -- 2.24.3 (Apple Git-128) [-- Attachment #3: benchmark.2.json --] [-- Type: application/json, Size: 1689073 bytes --] [-- Attachment #4: benchmark.el --] [-- Type: application/octet-stream, Size: 1134 bytes --] checkpoint 2 - benchmark.1.json (22M) - open literally (benchmark-run 10 (tree-sitter-parser-root-node (tree-sitter-create-parser (current-buffer) (tree-sitter-json)))) RESULT: stuck, used all my memory (14G and still growing) (benchmark-run 1 (tree-sitter-parser-root-node (tree-sitter-create-parser (current-buffer) (tree-sitter-json)))) 17s, 3G memory. \f checkpoint 2 - benchmark.2.json (1.6M) - fundamental-mode (benchmark-run 1 (tree-sitter-parser-root-node (tree-sitter-create-parser (current-buffer) (tree-sitter-json)))) (1.039289 0 0.0) (benchmark-run 1000 (dotimes (_ 1000) (insert "1,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,\n")) (dotimes (_ 1000) (backward-delete-char (length "1,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,\n")))) With parser: (39.302071 8 4.3011029999999995) Without parser: (33.589416 8 4.405495999999999) Note: Warning (undo): Buffer ‘benchmark.2.json’ undo info was 27188988 bytes long. The undo info was discarded because it exceeded `undo-outer-limit'. [-- Attachment #5: Type: text/plain, Size: 8 bytes --] Yuan ^ permalink raw reply related [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-17 2:05 ` Yuan Fu @ 2021-07-17 2:23 ` Clément Pit-Claudel 2021-07-17 3:12 ` Yuan Fu ` (2 more replies) 2021-07-17 6:56 ` Eli Zaretskii 1 sibling, 3 replies; 370+ messages in thread From: Clément Pit-Claudel @ 2021-07-17 2:23 UTC (permalink / raw) To: emacs-devel On 7/16/21 10:05 PM, Yuan Fu wrote: > My conclusion is that after-change-hook is pretty insignificant, and the initial parse is a bit slow (on large files). I have no idea if it makes sense, but: does the initial parse need to be synchronous, or could you instead run the parsing in one thread, and the rest of Emacs in another? (I'm talking about concurrent execution, not cooperative threading). In most cases there should be very limited contention, if at at all: in large buffers most of Emacs' activity will be focused on the (relatively few) characters around the gap, and most of the parser's activity will be reading from the buffer at other positions. You do need to be careful to not read the garbage data from the gap, but otherwise seeing stale or even inconsistent data from the parser thread shouldn't be an issue, since tree-sitter is supposed to be robust to bad parses. In fact, depending on how robust tree-sitter is, you might even be able to do the concurrency-control optimistically (parse everything up to close to the gap, check that the gap hasn't moved into the region that you read, and then resume reading or rollback). Alternatively, maybe you could even do a full parse with minimal concurrency control: you'd make sure that the Emacs thread records not just changes to the buffer text but also movements of the gap, and then you could use that list of changes for the next parse? Anyway, thanks for working on this! ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-17 2:23 ` Clément Pit-Claudel @ 2021-07-17 3:12 ` Yuan Fu 2021-07-17 7:18 ` Eli Zaretskii 2021-07-17 7:16 ` Eli Zaretskii 2021-07-17 17:30 ` Stefan Monnier 2 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-07-17 3:12 UTC (permalink / raw) To: Clément Pit-Claudel; +Cc: emacs-devel > On Jul 16, 2021, at 10:23 PM, Clément Pit-Claudel <cpitclaudel@gmail.com> wrote: > > On 7/16/21 10:05 PM, Yuan Fu wrote: >> My conclusion is that after-change-hook is pretty insignificant, and the initial parse is a bit slow (on large files). > > I have no idea if it makes sense, but: does the initial parse need to be synchronous, or could you instead run the parsing in one thread, and the rest of Emacs in another? (I'm talking about concurrent execution, not cooperative threading). > > In most cases there should be very limited contention, if at at all: in large buffers most of Emacs' activity will be focused on the (relatively few) characters around the gap, and most of the parser's activity will be reading from the buffer at other positions. You do need to be careful to not read the garbage data from the gap, but otherwise seeing stale or even inconsistent data from the parser thread shouldn't be an issue, since tree-sitter is supposed to be robust to bad parses. > > In fact, depending on how robust tree-sitter is, you might even be able to do the concurrency-control optimistically (parse everything up to close to the gap, check that the gap hasn't moved into the region that you read, and then resume reading or rollback). > > Alternatively, maybe you could even do a full parse with minimal concurrency control: you'd make sure that the Emacs thread records not just changes to the buffer text but also movements of the gap, and then you could use that list of changes for the next parse? Another way I thought about is to only “expose” the portion of buffer from BOB to some point to tree-sitter. And when a user asks for a parse tree, he also specifies to which point of the buffer he needs the parse tree for. For example, for fortification, jit-lock only needs the tree up to the end of the visible window. And for structure editing, asking for the portion up to window-end + a few thousand characters might be enough. However this heuristic could have problems in practice. (Maybe a giant comment section of thousands of characters follows, and instead of jumping to the end of it, we wrongly jump to middle of that comment section, because tree-sitter only “sees” to that point.) So I don’t know if it’s a good idea. > > Anyway, thanks for working on this! > I figure that this is low-tech enough that an amateur like me could possibly do it ;-) Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-17 3:12 ` Yuan Fu @ 2021-07-17 7:18 ` Eli Zaretskii 0 siblings, 0 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-07-17 7:18 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Fri, 16 Jul 2021 23:12:00 -0400 > Cc: emacs-devel@gnu.org > > Another way I thought about is to only “expose” the portion of buffer from BOB to some point to tree-sitter. And when a user asks for a parse tree, he also specifies to which point of the buffer he needs the parse tree for. For example, for fortification, jit-lock only needs the tree up to the end of the visible window. And for structure editing, asking for the portion up to window-end + a few thousand characters might be enough. Yes, I think we should only ask TS to parse what we need, not more. > However this heuristic could have problems in practice. (Maybe a giant comment section of thousands of characters follows, and instead of jumping to the end of it, we wrongly jump to middle of that comment section, because tree-sitter only “sees” to that point.) So I don’t know if it’s a good idea. It's definitely a good idea that should be pursued. Even if in some specific situation you'd need to pass to TS a large part of buffer text, it will help in the other cases. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-17 2:23 ` Clément Pit-Claudel 2021-07-17 3:12 ` Yuan Fu @ 2021-07-17 7:16 ` Eli Zaretskii 2021-07-20 20:36 ` Clément Pit-Claudel 2021-07-17 17:30 ` Stefan Monnier 2 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-17 7:16 UTC (permalink / raw) To: Clément Pit-Claudel; +Cc: emacs-devel > From: Clément Pit-Claudel <cpitclaudel@gmail.com> > Date: Fri, 16 Jul 2021 22:23:26 -0400 > > On 7/16/21 10:05 PM, Yuan Fu wrote: > > My conclusion is that after-change-hook is pretty insignificant, and the initial parse is a bit slow (on large files). > > I have no idea if it makes sense, but: does the initial parse need to be synchronous, or could you instead run the parsing in one thread, and the rest of Emacs in another? (I'm talking about concurrent execution, not cooperative threading). You cannot have a thread freely accessing buffer text when the Lisp machine is allowed to run concurrently with this, because the Lisp machine can change the buffer text. > In most cases there should be very limited contention, if at at all: in large buffers most of Emacs' activity will be focused on the (relatively few) characters around the gap, and most of the parser's activity will be reading from the buffer at other positions. When Emacs moves or enlarges/shrinks the gap, that affects the entire buffer text after the gap, regardless of where the gap is. So it will affect the TS reader if it reads stuff after the gap. > You do need to be careful to not read the garbage data from the gap, but otherwise seeing stale or even inconsistent data from the parser thread shouldn't be an issue, since tree-sitter is supposed to be robust to bad parses. What would be the purpose of calling the parser if we know in advance it will fail when it gets to the "garbage" caused by async access to the buffer text? And besides, current Emacs primitives that access buffer text don't necessarily do that atomically, since the assumption built into their design is that no one should access that text at the same time. So you could have windows where the buffer text is in inconsistent state, like if the gap was moved, but the variables which tell where the gap is were not yet updated, or windows where a multibyte character was not yet completely written or deleted to/from the buffer, resulting in invalid multibyte sequences and inconsistent values of EOB. So I don't see how this could be done without some inter-locking. And what do you want the code which requested parsing do while the parse thread runs? The requesting code is in the main thread, so if it just waits, you don't gain anything. > In fact, depending on how robust tree-sitter is, you might even be able to do the concurrency-control optimistically (parse everything up to close to the gap, check that the gap hasn't moved into the region that you read, and then resume reading or rollback). I don't understand what you suggest here. For starters, the gap could move (assuming you are still talking about a separate thread that does the parsing), and what do we do then? > Alternatively, maybe you could even do a full parse with minimal concurrency control: you'd make sure that the Emacs thread records not just changes to the buffer text but also movements of the gap, and then you could use that list of changes for the next parse? I don't understand what could recording the gap solve. The stuff in the gap is generally garbage, and can easily include invalid multibyte sequences. I don't think it's a good idea to pass that to TS. Also, recording the gap changes in the main thread and accessing that information from a concurrent thread again opens a window for races, and requires synchronization. Bottom line, I think what you are suggesting is premature optimization: we don't yet know that we will need this. If the TS performance information is reliable, it should be fast enough for our purposes; we just need to come up with an optimal way of calling it so that we don't impose unnecessary delays. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-17 7:16 ` Eli Zaretskii @ 2021-07-20 20:36 ` Clément Pit-Claudel 2021-07-21 11:26 ` Eli Zaretskii 2021-07-21 16:29 ` Stephen Leake 0 siblings, 2 replies; 370+ messages in thread From: Clément Pit-Claudel @ 2021-07-20 20:36 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel Thanks for the detailed reply. On 7/17/21 3:16 AM, Eli Zaretskii wrote: > When Emacs moves or enlarges/shrinks the gap, that affects the entire > buffer text after the gap, regardless of where the gap is. So it will > affect the TS reader if it reads stuff after the gap. Doesn't enlarging the gap require allocating a new buffer and copying data to it? If so it wouldn't affect the TS reader. Moving is indeed trickier, that's what I referred to as "limited contention". >> You do need to be careful to not read the garbage data from the gap, but otherwise seeing stale or even inconsistent data from the parser thread shouldn't be an issue, since tree-sitter is supposed to be robust to bad parses. > > What would be the purpose of calling the parser if we know in advance > it will fail when it gets to the "garbage" caused by async access to > the buffer text? It won't fail, will it? I thought this was the point of TS, that it would reuse the initial parse on the "good" parts in subsequent parses. > So I don't see how this could be done without some inter-locking. Yes, there probably need to be some care around the gap area. But that's what I was referring to re. "optimistic concurrency". > And what do you want the code which requested parsing do while the > parse thread runs? The requesting code is in the main thread, so if > it just waits, you don't gain anything. You'd have the parser running continuously in the background, every time there is a change. When a piece of code requests a parse it blocks and waits, but presumably for not too long because a very recent previous parse means that the blocking parse is fast. >> In fact, depending on how robust tree-sitter is, you might even be able to do the concurrency-control optimistically (parse everything up to close to the gap, check that the gap hasn't moved into the region that you read, and then resume reading or rollback). > > I don't understand what you suggest here. For starters, the gap could > move (assuming you are still talking about a separate thread that does > the parsing), and what do we do then? Nothing, we start the next parse when this one completes. > I don't understand what could recording the gap solve. The stuff in > the gap is generally garbage, and can easily include invalid multibyte > sequences. I don't think it's a good idea to pass that to TS. Also, > recording the gap changes in the main thread and accessing that > information from a concurrent thread again opens a window for races, > and requires synchronization. This list of gap changes wouldn't be accessed concurrently: you would (message-)pass a copy of it to the parser thread every time it starts a new parse. > Bottom line, I think what you are suggesting is premature > optimization: we don't yet know that we will need this. I thought we knew that a full parse of some files could take over a second; but yes, it will be nice if we can find a synchronous way to avoid having to do a full parse. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-20 20:36 ` Clément Pit-Claudel @ 2021-07-21 11:26 ` Eli Zaretskii 2021-07-21 13:38 ` Clément Pit-Claudel 2021-07-21 16:29 ` Stephen Leake 1 sibling, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-21 11:26 UTC (permalink / raw) To: Clément Pit-Claudel; +Cc: emacs-devel > Cc: emacs-devel@gnu.org > From: Clément Pit-Claudel <cpitclaudel@gmail.com> > Date: Tue, 20 Jul 2021 16:36:42 -0400 > > Thanks for the detailed reply. > > On 7/17/21 3:16 AM, Eli Zaretskii wrote: > > When Emacs moves or enlarges/shrinks the gap, that affects the entire > > buffer text after the gap, regardless of where the gap is. So it will > > affect the TS reader if it reads stuff after the gap. > > Doesn't enlarging the gap require allocating a new buffer and copying data to it? Not necessarily. First, gap could be enlarged for reasons other than growing buffer text as a whole. And even if we must grow buffer text, a good memory-allocation system will many times resize the existing memory block before it allocates another.. > If so it wouldn't affect the TS reader. Not true, in general. When a new block is allocated by the OS/libc, the old one is generally invalid and cannot be accessed. In many cases, the old block could be unmapped from the program's address space, in which case accessing it will segfault. > Moving is indeed trickier, that's what I referred to as "limited contention". We move the gap quite a lot. > >> You do need to be careful to not read the garbage data from the gap, but otherwise seeing stale or even inconsistent data from the parser thread shouldn't be an issue, since tree-sitter is supposed to be robust to bad parses. > > > > What would be the purpose of calling the parser if we know in advance > > it will fail when it gets to the "garbage" caused by async access to > > the buffer text? > > It won't fail, will it? "Fail" in the sense that it will be able to process only a small portion of buffer text before it gets to garbage. > > And what do you want the code which requested parsing do while the > > parse thread runs? The requesting code is in the main thread, so if > > it just waits, you don't gain anything. > > You'd have the parser running continuously in the background, every time there is a change. When a piece of code requests a parse it blocks and waits, but presumably for not too long because a very recent previous parse means that the blocking parse is fast. Well, you cannot safely/usefully parse the buffer "continuously in the background", for the reasons explained above, because Lisp programs change buffer text quite a lot. > > I don't understand what could recording the gap solve. The stuff in > > the gap is generally garbage, and can easily include invalid multibyte > > sequences. I don't think it's a good idea to pass that to TS. Also, > > recording the gap changes in the main thread and accessing that > > information from a concurrent thread again opens a window for races, > > and requires synchronization. > > This list of gap changes wouldn't be accessed concurrently: you would (message-)pass a copy of it to the parser thread every time it starts a new parse. I still don't see the point. Can you describe in more detail what would you suggest doing with the list of gap changes? Just take a specific example of a small set of gap changes and tell how to use that. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-21 11:26 ` Eli Zaretskii @ 2021-07-21 13:38 ` Clément Pit-Claudel 2021-07-21 13:51 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Clément Pit-Claudel @ 2021-07-21 13:38 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel On 7/21/21 7:26 AM, Eli Zaretskii wrote: > I still don't see the point. Can you describe in more detail what > would you suggest doing with the list of gap changes? Just take a > specific example of a small set of gap changes and tell how to use > that. I can try, but the idea was half-baked from the start, so I'm not sure how much value it will bring. All I was saying is that depending on how robust TS is, feeding it: <valuable text><small bit of the gap because the gap moved while TS was scanning><more valuable data> and then, knowing that the gap had moved, re-feeding it just the area that corresponds to the places around the boundaries of the gap might yield a speedup. So if the buffer is XYYGGGZ, where G is the gap, and becomes XGGIYYZ while we're scanning because of cursor motion + an insertion, then TS might see XYGIYYZ, due to concurrent mutations; but if we recorded that the gap moved and insertions happened at -#####---, then we can re-feed GGIYY to TS (omitting the Gs, of course), and hopefully it can reuse the parse of X and Z. If X and Z are long enough, that can be valuable. Alternatively, keeping the list of changes allows us to maintain a copy of the buffer that TS uses for scanning, with updates delayed until TS is done scanning. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-21 13:38 ` Clément Pit-Claudel @ 2021-07-21 13:51 ` Eli Zaretskii 2021-07-22 4:59 ` Clément Pit-Claudel 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-21 13:51 UTC (permalink / raw) To: Clément Pit-Claudel; +Cc: emacs-devel > From: Clément Pit-Claudel <cpitclaudel@gmail.com> > Date: Wed, 21 Jul 2021 09:38:31 -0400 > Cc: emacs-devel@gnu.org > > <valuable text><small bit of the gap because the gap moved while TS was scanning><more valuable data> > > and then, knowing that the gap had moved, re-feeding it just the area that corresponds to the places around the boundaries of the gap might yield a speedup. You are assuming that TS will be able to process both <valuable text> and <more valuable data>, even though it eats the garbage in the gap? That isn't guaranteed, due to possibly invalid byte sequences in the gap. Without synchronization, you also risk reading invalid byte sequences even outside the gap, because while you read part of a byte sequence, some editing operation modifies the buffer at that very place. > Alternatively, keeping the list of changes allows us to maintain a copy of the buffer that TS uses for scanning, with updates delayed until TS is done scanning. Having a copy for each buffer that needs parsing doesn't scale. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-21 13:51 ` Eli Zaretskii @ 2021-07-22 4:59 ` Clément Pit-Claudel 2021-07-22 6:38 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Clément Pit-Claudel @ 2021-07-22 4:59 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel On 7/21/21 9:51 AM, Eli Zaretskii wrote: >> From: Clément Pit-Claudel <cpitclaudel@gmail.com> >> Date: Wed, 21 Jul 2021 09:38:31 -0400 >> Cc: emacs-devel@gnu.org >> >> <valuable text><small bit of the gap because the gap moved while TS was scanning><more valuable data> >> >> and then, knowing that the gap had moved, re-feeding it just the area that corresponds to the places around the boundaries of the gap might yield a speedup. > > You are assuming that TS will be able to process both <valuable text> > and <more valuable data>, even though it eats the garbage in the gap? > That isn't guaranteed, due to possibly invalid byte sequences in the > gap. Yes, that's fair. >> Alternatively, keeping the list of changes allows us to maintain a copy of the buffer that TS uses for scanning, with updates delayed until TS is done scanning. > > Having a copy for each buffer that needs parsing doesn't scale. Because of time, or because of memory? I though we assumed memory was a non-issue, because tree-sitter's data structures seem to require *a lot* more space than the text of the underlying buffer (in 2018 the main dev said "syntax trees still use over 10x as much memory as the size of the source file."). Copying time can be an issue, for sure, but memcpy() is fast these days ^^ ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-22 4:59 ` Clément Pit-Claudel @ 2021-07-22 6:38 ` Eli Zaretskii 0 siblings, 0 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-07-22 6:38 UTC (permalink / raw) To: Clément Pit-Claudel; +Cc: emacs-devel > Cc: emacs-devel@gnu.org > From: Clément Pit-Claudel <cpitclaudel@gmail.com> > Date: Thu, 22 Jul 2021 00:59:31 -0400 > > >> Alternatively, keeping the list of changes allows us to maintain a copy of the buffer that TS uses for scanning, with updates delayed until TS is done scanning. > > > > Having a copy for each buffer that needs parsing doesn't scale. > > Because of time, or because of memory? Memory, mostly. > I though we assumed memory was a non-issue, because tree-sitter's data structures seem to require *a lot* more space than the text of the underlying buffer (in 2018 the main dev said "syntax trees still use over 10x as much memory as the size of the source file."). You are talking about _adding_ to that another copy of the buffer's text, which could be many megabytes. And your proposal means we will have such copies for many buffers. As for the TS memory requirements, if they really need 1GB for a 100MB file (I doubt that), then TS is probably not a good candidate for Emacs. > Copying time can be an issue, for sure, but memcpy() is fast these days ^^ You forget the time needed to allocate the memory for the copy, that could be orders of magnitude slower for large buffers, especially if there's a lot of memory pressure on the OS. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-20 20:36 ` Clément Pit-Claudel 2021-07-21 11:26 ` Eli Zaretskii @ 2021-07-21 16:29 ` Stephen Leake 2021-07-21 16:54 ` Clément Pit-Claudel 1 sibling, 1 reply; 370+ messages in thread From: Stephen Leake @ 2021-07-21 16:29 UTC (permalink / raw) To: Clément Pit-Claudel; +Cc: Eli Zaretskii, emacs-devel Clément Pit-Claudel <cpitclaudel@gmail.com> writes: > On 7/17/21 3:16 AM, Eli Zaretskii wrote: >>> You do need to be careful to not read the garbage data from the >>> gap, but otherwise seeing stale or even inconsistent data from the >>> parser thread shouldn't be an issue, since tree-sitter is supposed >>> to be robust to bad parses. >> >> What would be the purpose of calling the parser if we know in advance >> it will fail when it gets to the "garbage" caused by async access to >> the buffer text? > > It won't fail, will it? I thought this was the point of TS, that it > would reuse the initial parse on the "good" parts in subsequent > parses. There are limits to the error recovery, and throwing garbage text at it is likely to encounter those limits. wisi is even more robust, but I still get "error recover fail" daily. >> So I don't see how this could be done without some inter-locking. > > Yes, there probably need to be some care around the gap area. But > that's what I was referring to re. "optimistic concurrency". > >> And what do you want the code which requested parsing do while the >> parse thread runs? The requesting code is in the main thread, so if >> it just waits, you don't gain anything. > > You'd have the parser running continuously in the background, every > time there is a change. > When a piece of code requests a parse it blocks and waits, but > presumably for not too long because a very recent previous parse means > that the blocking parse is fast. If the parser is truly fast enough to keep up with typing, this does make sense. Good error correction is slower than non-so-good error correction, so there might be a trade-off here. On the other hand, in the typical case of the user typing characters, font-lock is triggered on every character, so the parser is effectively synchronous, and the inter-thread communication is wasted time. We need some metrics on a real implementation to decide this part of the design. >>> In fact, depending on how robust tree-sitter is, you might even be >>> able to do the concurrency-control optimistically (parse everything >>> up to close to the gap, check that the gap hasn't moved into the >>> region that you read, and then resume reading or rollback). >> >> I don't understand what you suggest here. For starters, the gap could >> move (assuming you are still talking about a separate thread that does >> the parsing), and what do we do then? > > Nothing, we start the next parse when this one completes. By "nothing", I think you mean "abort the parse". >> Bottom line, I think what you are suggesting is premature >> optimization: we don't yet know that we will need this. > > I thought we knew that a full parse of some files could take over a > second; Yes, for both tree-sitter and wisi. wisi can take even longer if lots of error correction is required (I have a time-out set at 5 seconds). But that happens when the file is first opened; I doubt any user would start typing that fast. I know I typically take a while to just look at the text, and then navigate to the point of interest. > but yes, it will be nice if we can find a synchronous way to avoid > having to do a full parse. Hmm. "looking at the text" is better done after it is fontified, so doing a faster but possibly worse parse and fontification on just the initial visible region might be a good idea. While the partial parse is running, we could also spawn a parser thread to run the full parse. And if the user scrolls before the full parse is done, do a second partial parse on the new visible region. I'll put that on my list of things to try in wisi. -- -- Stephe ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-21 16:29 ` Stephen Leake @ 2021-07-21 16:54 ` Clément Pit-Claudel 2021-07-21 19:43 ` Eli Zaretskii 2021-07-21 21:54 ` Stephen Leake 0 siblings, 2 replies; 370+ messages in thread From: Clément Pit-Claudel @ 2021-07-21 16:54 UTC (permalink / raw) To: emacs-devel On 7/21/21 12:29 PM, Stephen Leake wrote: > Yes, for both tree-sitter and wisi. wisi can take even longer if lots of > error correction is required (I have a time-out set at 5 seconds). But > that happens when the file is first opened; I doubt any user would start > typing that fast. I know I typically take a while to just look at the > text, and then navigate to the point of interest. I'm not sure. We've had significant complaint in Flycheck for freezing Emacs for <1s: we have a synchronous sanity check to determine whether a checker can execute in a buffer (it runs a single time, and it should be async but I haven't gotten around to rewriting it). The problem is that some programs, including eslint, can take as much 1s, and in some bad cases 2-3 seconds, to parse their own config and decide if they can even run. Users have complained about this delay. It might be better if they were able to scroll around, though — is that what happens with WISI? But if we have a fully synchronous TS, then that won't be possible either: it will be a complete Emacs freeze, no? ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-21 16:54 ` Clément Pit-Claudel @ 2021-07-21 19:43 ` Eli Zaretskii 2021-07-24 2:57 ` Stephen Leake 2021-07-24 3:55 ` Clément Pit-Claudel 2021-07-21 21:54 ` Stephen Leake 1 sibling, 2 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-07-21 19:43 UTC (permalink / raw) To: Clément Pit-Claudel; +Cc: emacs-devel > From: Clément Pit-Claudel <cpitclaudel@gmail.com> > Date: Wed, 21 Jul 2021 12:54:16 -0400 > > On 7/21/21 12:29 PM, Stephen Leake wrote: > > Yes, for both tree-sitter and wisi. wisi can take even longer if lots of > > error correction is required (I have a time-out set at 5 seconds). But > > that happens when the file is first opened; I doubt any user would start > > typing that fast. I know I typically take a while to just look at the > > text, and then navigate to the point of interest. > > I'm not sure. We've had significant complaint in Flycheck for freezing Emacs for <1s How much "less"? Close to 1 sec is indeed annoying, but 20 msec or so should be bearable. You seem to assume up front that TS (re)-parsing will take 1 sec, but AFAIK there's no reason to assume such bad performance. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-21 19:43 ` Eli Zaretskii @ 2021-07-24 2:57 ` Stephen Leake 2021-07-24 3:39 ` Óscar Fuentes 2021-07-24 7:06 ` Eli Zaretskii 2021-07-24 3:55 ` Clément Pit-Claudel 1 sibling, 2 replies; 370+ messages in thread From: Stephen Leake @ 2021-07-24 2:57 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Clément Pit-Claudel, emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> From: Clément Pit-Claudel <cpitclaudel@gmail.com> >> Date: Wed, 21 Jul 2021 12:54:16 -0400 >> >> On 7/21/21 12:29 PM, Stephen Leake wrote: >> > Yes, for both tree-sitter and wisi. wisi can take even longer if lots of >> > error correction is required (I have a time-out set at 5 seconds). But >> > that happens when the file is first opened; I doubt any user would start >> > typing that fast. I know I typically take a while to just look at the >> > text, and then navigate to the point of interest. >> >> I'm not sure. We've had significant complaint in Flycheck for freezing Emacs for <1s > > How much "less"? Close to 1 sec is indeed annoying, but 20 msec or so > should be bearable. > > You seem to assume up front that TS (re)-parsing will take 1 sec, but > AFAIK there's no reason to assume such bad performance. This is for the initial parse, on a large file. No matter how fast the parser is, I can give you a file that takes one second to parse, and some user will have such a file (the work always expands to consume all the resources available). I just got incremental parse working well enough to measure it; in the largest Ada file I have (10,000 lines from Eurocontrol): initial parse: 1.539319 seconds re-indent two lines: 0.038999 seconds 39 milliseconds for re-indent is just slow enough to be noticeable; I still have algorithms to convert to be as incremental as possible. The initial parse includes sending the full file text to the external process over a pipe. Parsing that same large file with the command-line parser (no emacs involved; file is memory-mapped) takes only 0.190 seconds, so there is lots of room for optimization - moving to a module with direct access to the emacs buffer should do a lot. In a very small file: initial 0.000632 seconds re-indent 0.000942 seconds Easily fast enough to keep up with the user. I don't have a direct comparison of tree-sitter and wisi parsing the same file; I'll have to see if I can set that up. -- -- Stephe ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-24 2:57 ` Stephen Leake @ 2021-07-24 3:39 ` Óscar Fuentes 2021-07-24 7:34 ` Eli Zaretskii 2021-07-25 16:49 ` Stephen Leake 2021-07-24 7:06 ` Eli Zaretskii 1 sibling, 2 replies; 370+ messages in thread From: Óscar Fuentes @ 2021-07-24 3:39 UTC (permalink / raw) To: emacs-devel Stephen Leake <stephen_leake@stephe-leake.org> writes: > 39 milliseconds for re-indent is just slow enough to be noticeable; I still > have algorithms to convert to be as incremental as possible. [snip] > In a very small file: > > initial 0.000632 seconds > re-indent 0.000942 seconds > > Easily fast enough to keep up with the user. Doing work every time the user changes the file is not always a good thing. Nowadays the user doesn't just expect automatic indentation, he wants code formatting too, which means splitting, fusing and inserting lines, plus moving chunks of code left and right. Doing that every time a character is added or deleted can be visually confusing due to chunks of text changing positions as you type, so the systems I know are triggered by certain events (like the insertion of characters that mark the end of statements). Then they analyze the code and, if it is well formed, apply the reformatting. Something similar could be said about fontification and other tasks. In my experience, delays of 0.1 seconds are perfectly acceptable with this method. So I'll insist on not obsessing too much about performance. Implement something simple, see if it is usable. If not, invest effort on optimizations until it is good enough. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-24 3:39 ` Óscar Fuentes @ 2021-07-24 7:34 ` Eli Zaretskii 2021-07-25 16:49 ` Stephen Leake 1 sibling, 0 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-07-24 7:34 UTC (permalink / raw) To: Óscar Fuentes; +Cc: emacs-devel > From: Óscar Fuentes <ofv@wanadoo.es> > Date: Sat, 24 Jul 2021 05:39:09 +0200 > > So I'll insist on not obsessing too much about performance. Implement > something simple, see if it is usable. If not, invest effort on > optimizations until it is good enough. Agreed. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-24 3:39 ` Óscar Fuentes 2021-07-24 7:34 ` Eli Zaretskii @ 2021-07-25 16:49 ` Stephen Leake 1 sibling, 0 replies; 370+ messages in thread From: Stephen Leake @ 2021-07-25 16:49 UTC (permalink / raw) To: Óscar Fuentes; +Cc: emacs-devel Óscar Fuentes <ofv@wanadoo.es> writes: > Stephen Leake <stephen_leake@stephe-leake.org> writes: > >> 39 milliseconds for re-indent is just slow enough to be noticeable; I still >> have algorithms to convert to be as incremental as possible. > > [snip] > >> In a very small file: >> >> initial 0.000632 seconds >> re-indent 0.000942 seconds >> >> Easily fast enough to keep up with the user. > > Doing work every time the user changes the file is not always a good > thing. It very much depends on the user's preferences. Note that in standard Emacs usage, font-lock runs after every character is typed. With the current ada-mode release, which uses partial parse instead of incremental parse, the parse process cannot keep up with user typing. So I run with jit-lock-defer-time set to 1.5 seconds. However, many people want the fontification to be much more responsive. With wisi incremental parse, ada-mode can now do that. Since I got incremental parse working in wisi, I've set jit-lock-defer-time to the default nil; I like it, and will not go back. I mostly tolerated the delay before because I knew how hard it would be to fix :). > Nowadays the user doesn't just expect automatic indentation, he wants > code formatting too, which means splitting, fusing and inserting > lines, plus moving chunks of code left and right. Doing that every > time a character is added or deleted can be visually confusing due to > chunks of text changing positions as you type, so the systems I know > are triggered by certain events (like the insertion of characters that > mark the end of statements). Yes; different parser-based operations are triggered by different events. That is true for wisi now; font-lock is triggered by the standard Emacs mechanisms (ie, after every character is typed, the window is scrolled, etc), indent is triggered by the standard Emacs mechanisms (indent-region-function, indent-line-function; ie RET and TAB), navigate (computing single-file cross-reference) is triggered by forward-sexp or some similar "wisi-goto-*" function, reformatting is triggered by "align" (in parallel with the standard Emacs align mechanism) or a direct wisi-reformat-* function (there are some in a context menu for Ada). All of these operations update the parse tree only if the buffer has changed; if not, they use the existing tree. The user can always customize things - wisi provides the framework. > Then they analyze the code and, if it is well formed, apply the > reformatting. Something similar could be said about fontification and > other tasks. Wisi does indentation even in the presence of syntax errors (ie, not "well formed"). This helps when writing code; when entering an "if" statement, you don't have to start with a complete template; you just type the code. It does sometimes cause confusing results; fixing the syntax always resolves that. > So I'll insist on not obsessing too much about performance. Implement > something simple, see if it is usable. If not, invest effort on > optimizations until it is good enough. Yes; premature optimization is the enemy of good enough. And good benchmarks/metrics should be the guide of any optimization; allowing font-lock to run after every character is typed is one such metric. Indentation in the presence of syntax errors is another; that was the primary complaint about ada-mode before I implemented error correction, and is still a common complaint; incremental parse will improve that. These two metrics were the trigger that started me implementing incremental parse. -- -- Stephe ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-24 2:57 ` Stephen Leake 2021-07-24 3:39 ` Óscar Fuentes @ 2021-07-24 7:06 ` Eli Zaretskii 2021-07-25 17:48 ` Stephen Leake 1 sibling, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-24 7:06 UTC (permalink / raw) To: Stephen Leake; +Cc: cpitclaudel, emacs-devel > From: Stephen Leake <stephen_leake@stephe-leake.org> > Cc: Clément Pit-Claudel <cpitclaudel@gmail.com>, > emacs-devel@gnu.org > Date: Fri, 23 Jul 2021 19:57:32 -0700 > > > How much "less"? Close to 1 sec is indeed annoying, but 20 msec or so > > should be bearable. > > > > You seem to assume up front that TS (re)-parsing will take 1 sec, but > > AFAIK there's no reason to assume such bad performance. > > This is for the initial parse, on a large file. No matter how fast the > parser is, I can give you a file that takes one second to parse, and > some user will have such a file (the work always expands to consume all > the resources available). That problem is already with us: if I visit xdisp.c in an unoptimized build of Emacs 28, I wait almost 4 sec for the first window-full to be displayed. (It's more like 0.5 sec in an optimized build of Emacs 27.2.) So the real question is how much using TS will _improve_ the situation. > I just got incremental parse working well enough to measure it; in the > largest Ada file I have (10,000 lines from Eurocontrol): > > initial parse: 1.539319 seconds > re-indent two lines: 0.038999 seconds > > 39 milliseconds for re-indent is just slow enough to be noticeable; I still > have algorithms to convert to be as incremental as possible. For comparison, how much does re-indentation of 2 lines take in Emacs without a parser? 39 msec might be noticeable, but it isn't annoying; anything below 50 msec isn't. Try "C-x TAB" in Emacs on 10-line block of text, and you get more than that. So if you consider that time a problem, it is here already as well. > The initial parse includes sending the full file text to the external > process over a pipe. So the above results are with wisi. We need timings with TS to see the results that really matter for this discussion. > I don't have a direct comparison of tree-sitter and wisi parsing the > same file; I'll have to see if I can set that up. Please do. Otherwise we are comparing apples with oranges. They are all fruit, but still... Thanks. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-24 7:06 ` Eli Zaretskii @ 2021-07-25 17:48 ` Stephen Leake 0 siblings, 0 replies; 370+ messages in thread From: Stephen Leake @ 2021-07-25 17:48 UTC (permalink / raw) To: Eli Zaretskii; +Cc: cpitclaudel, emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> From: Stephen Leake <stephen_leake@stephe-leake.org> >> Cc: Clément Pit-Claudel <cpitclaudel@gmail.com>, >> emacs-devel@gnu.org >> Date: Fri, 23 Jul 2021 19:57:32 -0700 >> >> > How much "less"? Close to 1 sec is indeed annoying, but 20 msec or so >> > should be bearable. >> > >> > You seem to assume up front that TS (re)-parsing will take 1 sec, but >> > AFAIK there's no reason to assume such bad performance. >> >> This is for the initial parse, on a large file. No matter how fast the >> parser is, I can give you a file that takes one second to parse, and >> some user will have such a file (the work always expands to consume all >> the resources available). > > That problem is already with us: if I visit xdisp.c in an unoptimized > build of Emacs 28, I wait almost 4 sec for the first window-full to be > displayed. (It's more like 0.5 sec in an optimized build of Emacs > 27.2.) So the real question is how much using TS will _improve_ the > situation. Yes. But here other solutions, like parsing only part of the buffer, offer much better improvement. >> I just got incremental parse working well enough to measure it; in the >> largest Ada file I have (10,000 lines from Eurocontrol): >> >> initial parse: 1.539319 seconds >> re-indent two lines: 0.038999 seconds >> >> 39 milliseconds for re-indent is just slow enough to be noticeable; I still >> have algorithms to convert to be as incremental as possible. > > For comparison, how much does re-indentation of 2 lines take in Emacs > without a parser? I don't think this is a meaningful question, or at least, I don't have an answer. For ada-mode, you'd have to go back to version 4.0, where the indentation was ad-hoc elisp. It was fast enough to be not noticeable. But I switched to a parser because that indentation algorithm was often incorrect, and was very brittle in the face of new features in new Ada language standard releases. Other languages don't use a parser for indentation, so there's no way to compare. Even the AdaCore editor Gnat Studio doesn't use their parser for indentation in Ada; Emacs ada-mode is the only one I know of. I guess you could say it's a trade of indentation quality vs speed. Witness the recent thread about inconsistent fontification in C; a parser would resolve that, but LSP via eglot is probably slower than the current elisp. Indentation is similar, but the quality difference is bigger, at least for Ada. > 39 msec might be noticeable, but it isn't annoying; anything below 50 > msec isn't. You are right; in that large Ada file, I don't notice the font-lock delay after typing each character. > Try "C-x TAB" in Emacs on 10-line block of text, and you get more than > that. Depends on the mode; text-mode: 0.4 microseconds. In xdisp.c, indenting it_char_has_category, 47.5 milliseconds. In benchmark.el, indenting benchmark-call; 1.2 milliseconds. The computation here is font-lock due to the text moving in the buffer; in the ada-mode benchmark above, it is computing indent. Calling indent-rigidly, then indent-region (which results in zero net buffer change, so apparently no significant font-lock), I get: xdisp.c: 17.1 ms benchmark.el: 3.6 ms -- -- Stephe ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-21 19:43 ` Eli Zaretskii 2021-07-24 2:57 ` Stephen Leake @ 2021-07-24 3:55 ` Clément Pit-Claudel 1 sibling, 0 replies; 370+ messages in thread From: Clément Pit-Claudel @ 2021-07-24 3:55 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel On 7/21/21 3:43 PM, Eli Zaretskii wrote: >> From: Clément Pit-Claudel <cpitclaudel@gmail.com> >> I'm not sure. We've had significant complaint in Flycheck for freezing Emacs for <1s > > How much "less"? Close to 1 sec is indeed annoying, but 20 msec or so > should be bearable. Indeed, for us the freeze is only in when the buffer is first open, so 20ms is fine; the cases we had complains about where close to 1s, maybe .8s (and in some cases significantly more, too). > You seem to assume up front that TS (re)-parsing will take 1 sec, but > AFAIK there's no reason to assume such bad performance. I expect/hope re-parsing will be much faster. For the initial parse, I was going from numbers that were given earlier in this thread. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-21 16:54 ` Clément Pit-Claudel 2021-07-21 19:43 ` Eli Zaretskii @ 2021-07-21 21:54 ` Stephen Leake 2021-07-22 4:40 ` Clément Pit-Claudel 1 sibling, 1 reply; 370+ messages in thread From: Stephen Leake @ 2021-07-21 21:54 UTC (permalink / raw) To: Clément Pit-Claudel; +Cc: emacs-devel Clément Pit-Claudel <cpitclaudel@gmail.com> writes: > On 7/21/21 12:29 PM, Stephen Leake wrote: >> Yes, for both tree-sitter and wisi. wisi can take even longer if lots of >> error correction is required (I have a time-out set at 5 seconds). But >> that happens when the file is first opened; I doubt any user would start >> typing that fast. I know I typically take a while to just look at the >> text, and then navigate to the point of interest. > > I'm not sure. We've had significant complaint in Flycheck for freezing > Emacs for <1s: we have a synchronous sanity check to determine whether > a checker can execute in a buffer (it runs a single time, and it > should be async but I haven't gotten around to rewriting it). The > problem is that some programs, including eslint, can take as much 1s, > and in some bad cases 2-3 seconds, to parse their own config and > decide if they can even run. Ok. > Users have complained about this delay. It might be better if they > were able to scroll around, though — is that what happens with WISI? wisi supports partial parse; if a buffer is larger than a user-settable threshold, for font-lock it parses only the request region of the file, expanded to reasonable start/end points. So in that mode, the initial parse of even a very large buffer is fast. However, using that for indentation is problematic, which is why I'm implementing incremental parse. I think continuing to support both will be useful. > But if we have a fully synchronous TS, then that won't be possible > either: it will be a complete Emacs freeze, no? It should only freeze write operations on that buffer, so marking it read-only while waiting for the parse results might be best. -- -- Stephe ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-21 21:54 ` Stephen Leake @ 2021-07-22 4:40 ` Clément Pit-Claudel 0 siblings, 0 replies; 370+ messages in thread From: Clément Pit-Claudel @ 2021-07-22 4:40 UTC (permalink / raw) To: emacs-devel On 7/21/21 5:54 PM, Stephen Leake wrote: > It should only freeze write operations on that buffer, so marking it > read-only while waiting for the parse results might be best. Yes, I expect that would be much better than what we have. Thanks for your work on wisi, by the way! ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-17 2:23 ` Clément Pit-Claudel 2021-07-17 3:12 ` Yuan Fu 2021-07-17 7:16 ` Eli Zaretskii @ 2021-07-17 17:30 ` Stefan Monnier 2021-07-17 17:54 ` Eli Zaretskii ` (2 more replies) 2 siblings, 3 replies; 370+ messages in thread From: Stefan Monnier @ 2021-07-17 17:30 UTC (permalink / raw) To: Clément Pit-Claudel; +Cc: emacs-devel In your benchmark , you give numbers for: - initial full-text parse (a bit above 1MB/s) - cost of update-without-reparse but I think it would be nice to see the cost of the reparse after those updates (should be much faster than the initial parse). Clément said: > I have no idea if it makes sense, but: does the initial parse need to be > synchronous, or could you instead run the parsing in one thread, and the > rest of Emacs in another? (I'm talking about concurrent execution, not > cooperative threading). If we copy the buffer's content to a freshly malloc area before passing that to TS, then there should be no problem running TS in a separate concurrent thread, indeed. Eli said: > Why do you update the entire parser list for every modification? > This comment: If having multiple parsers in a single buffer is a not-uncommon case, then indeed we'll need to do better, but if we assume this is an anomalous situation, then Yuan's code is optimal ;-) > Yes, I think we should only ask TS to parse what we need, not more. We'll need to experiment with that. Using an approach like `syntax-ppss` where we only parse up to some high-watermark might be a good approach, but it's also possible that it will work poorly: if TS assumes it works on the whole buffer, then it will see the truncated text as a syntax error and while it is supposed to handle syntax errors nicely it may still lead to suboptimal behavior when parts of perfectly valid code is misparsed because the parser was not allowed to see the closing braces that make it "perfectly valid". Stefan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-17 17:30 ` Stefan Monnier @ 2021-07-17 17:54 ` Eli Zaretskii 2021-07-24 14:08 ` Stefan Monnier 2021-07-19 15:16 ` Yuan Fu 2021-07-20 16:32 ` Stephen Leake 2 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-17 17:54 UTC (permalink / raw) To: Stefan Monnier; +Cc: cpitclaudel, emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: emacs-devel@gnu.org > Date: Sat, 17 Jul 2021 13:30:40 -0400 > > Clément said: > > I have no idea if it makes sense, but: does the initial parse need to be > > synchronous, or could you instead run the parsing in one thread, and the > > rest of Emacs in another? (I'm talking about concurrent execution, not > > cooperative threading). > > If we copy the buffer's content to a freshly malloc area before passing > that to TS, then there should be no problem running TS in a separate > concurrent thread, indeed. Making a copy of the buffer is a non-starter from where I stand. It doesn't scale, for starters. I don't see any reason to go to such a complex design at this early stage. > Eli said: > > Why do you update the entire parser list for every modification? > > This comment: > > If having multiple parsers in a single buffer is a not-uncommon case, > then indeed we'll need to do better, but if we assume this is an > anomalous situation, then Yuan's code is optimal ;-) > > > Yes, I think we should only ask TS to parse what we need, not more. > > We'll need to experiment with that. We can experiment, but I think the basic design should be clean and reasonable from the get-go. > Using an approach like `syntax-ppss` where we only parse up to some > high-watermark might be a good approach, but it's also possible that > it will work poorly: if TS assumes it works on the whole buffer, > then it will see the truncated text as a syntax error and while it > is supposed to handle syntax errors nicely it may still lead to > suboptimal behavior when parts of perfectly valid code is misparsed > because the parser was not allowed to see the closing braces that > make it "perfectly valid". TS must be able to handle these situation well enough, because they happen during editing all the time. I wouldn't worry about that, definitely not at this stage. Different uses of the parse results will need to pass different chunks of buffer text, and that is okay. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-17 17:54 ` Eli Zaretskii @ 2021-07-24 14:08 ` Stefan Monnier 2021-07-24 14:32 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Stefan Monnier @ 2021-07-24 14:08 UTC (permalink / raw) To: Eli Zaretskii; +Cc: cpitclaudel, emacs-devel >> If we copy the buffer's content to a freshly malloc area before passing >> that to TS, then there should be no problem running TS in a separate >> concurrent thread, indeed. > Making a copy of the buffer is a non-starter from where I stand. It > doesn't scale, for starters. I don't see any reason to go to such a > complex design at this early stage. I see absolutely no problem with scaling in making a copy: the extra memory and CPU time taken by the copy will be a constant factor which I don't expect to go much beyond 10%, which doesn't threaten scaling and seems perfectly acceptable in return for being able to perform the parse concurrently. I'm not sure we'll want to do that, but I see no reason to consider it a non-starter. [ BTW, it's not clear to me if an update needs to be able to read the whole buffer or if it only needs access to the "update description". ] Stefan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-24 14:08 ` Stefan Monnier @ 2021-07-24 14:32 ` Eli Zaretskii 2021-07-24 15:10 ` Stefan Monnier 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-24 14:32 UTC (permalink / raw) To: Stefan Monnier; +Cc: cpitclaudel, emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: cpitclaudel@gmail.com, emacs-devel@gnu.org > Date: Sat, 24 Jul 2021 10:08:58 -0400 > > >> If we copy the buffer's content to a freshly malloc area before passing > >> that to TS, then there should be no problem running TS in a separate > >> concurrent thread, indeed. > > Making a copy of the buffer is a non-starter from where I stand. It > > doesn't scale, for starters. I don't see any reason to go to such a > > complex design at this early stage. > > I see absolutely no problem with scaling in making a copy: the extra > memory and CPU time taken by the copy will be a constant factor which > I don't expect to go much beyond 10% 10% of what? It will be 100% of all the buffers that need parsing. > I'm not sure we'll want to do that, but I see no reason to consider it > a non-starter. It's a bad start, okay? Anyway, it looks like nothing like that will be necessary, fortunately. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-24 14:32 ` Eli Zaretskii @ 2021-07-24 15:10 ` Stefan Monnier 2021-07-24 15:51 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Stefan Monnier @ 2021-07-24 15:10 UTC (permalink / raw) To: Eli Zaretskii; +Cc: cpitclaudel, emacs-devel >> I see absolutely no problem with scaling in making a copy: the extra >> memory and CPU time taken by the copy will be a constant factor which >> I don't expect to go much beyond 10% > 10% of what? It will be 100% of all the buffers that need parsing. 10% of the memory used by that buffer, since TS's data structure eats up about 10x the size of the buffer's text. Given the memory needs of TS we may decide to have a `tree-sitter-maximum-size` config to disable TS on overly large buffers (just like font-lock has such a setting, since when used without jit-lock, font-lock also can easily end up using more memory than the buffer's text). Stefan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-24 15:10 ` Stefan Monnier @ 2021-07-24 15:51 ` Eli Zaretskii 0 siblings, 0 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-07-24 15:51 UTC (permalink / raw) To: Stefan Monnier; +Cc: cpitclaudel, emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: cpitclaudel@gmail.com, emacs-devel@gnu.org > Date: Sat, 24 Jul 2021 11:10:26 -0400 > > >> I see absolutely no problem with scaling in making a copy: the extra > >> memory and CPU time taken by the copy will be a constant factor which > >> I don't expect to go much beyond 10% > > 10% of what? It will be 100% of all the buffers that need parsing. > > 10% of the memory used by that buffer, since TS's data structure eats up > about 10x the size of the buffer's text. That's still a lot of wasted storage, let alone time. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-17 17:30 ` Stefan Monnier 2021-07-17 17:54 ` Eli Zaretskii @ 2021-07-19 15:16 ` Yuan Fu 2021-07-22 3:10 ` Yuan Fu 2021-07-20 16:32 ` Stephen Leake 2 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-07-19 15:16 UTC (permalink / raw) To: Stefan Monnier; +Cc: Clément Pit-Claudel, emacs-devel [-- Attachment #1: Type: text/plain, Size: 1080 bytes --] > On Jul 17, 2021, at 1:30 PM, Stefan Monnier <monnier@iro.umontreal.ca> wrote: > > In your benchmark , you give numbers for: > - initial full-text parse (a bit above 1MB/s) > - cost of update-without-reparse > > but I think it would be nice to see the cost of the reparse after > those updates (should be much faster than the initial parse). I have done some more benchmark. Initially I thought tree-sitter doesn’t scale, because re-parsing my JSON file is unexpectedly slow, but then I retired with xdisp.c with tree-sitter's C parser, and that is really fast and matches my expectation of tree-sitter. So from now on I’ll use xdispf.c and the C parser for benchmarking. I guess the json parser is simply bad-written? I benchmarked with a simple C program. The programs are in main-c.c and main-json.c, and the shell output of the measurements is in benchmark.3.txt. JSON: Initial parse takes 1.2s, re-parse (with no change) takes 0.7s, uses 307MB memory C: Initial parse takes 0.14s, re-parse (with no change) takes 0.009s, uses 20MB memory Yuan [-- Attachment #2: benchmark.3.txt --] [-- Type: text/plain, Size: 2875 bytes --] On benchmark.2.json (1.6M) One full parse: 1.2s ________________________________________________________ Executed in 1.30 secs fish external usr time 1210.81 millis 142.00 micros 1210.67 millis sys time 87.40 millis 756.00 micros 86.65 millis One full parse and a re-parse: ________________________________________________________ Executed in 2.40 secs fish external usr time 1.95 secs 154.00 micros 1.95 secs sys time 0.15 secs 763.00 micros 0.15 secs Re-parse takes 1.95 - 1.21 = 0.74s Memory usage of full-parse + re-parse: 2.17 real 2.00 user 0.16 sys 307269632 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 75035 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 463 involuntary context switches 14674957821 instructions retired 7838514409 cycles elapsed 306745344 peak memory footprint 307MB for two trees that "shares internal structure". \f On xdisp.c (1.2M) One full paese: 0.139s ________________________________________________________ Executed in 478.23 millis fish external usr time 139.69 millis 134.00 micros 139.55 millis sys time 8.05 millis 829.00 micros 7.22 millis Full parse and re-parse: ________________________________________________________ Executed in 456.58 millis fish external usr time 148.23 millis 153.00 micros 148.08 millis sys time 9.08 millis 791.00 micros 8.29 millis 148 - 139 = 0.009s Memory usage of full-parse + re-parse: 0.16 real 0.15 user 0.00 sys 20131840 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 4932 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 28 involuntary context switches 1070525817 instructions retired 581557699 cycles elapsed 19271680 peak memory footprint 20MB [-- Attachment #3: main-c.c --] [-- Type: application/octet-stream, Size: 1180 bytes --] #include <string.h> #include <stdio.h> #include <tree_sitter/api.h> TSLanguage *tree_sitter_c(); struct buffer { char *buf; long len; }; const char *read_file(void *payload, uint32_t byte_index, TSPoint position, uint32_t *bytes_read) { long len = ((struct buffer *) payload)->len; if (byte_index >= len) { *bytes_read = 0; return (char *) ""; } else { *bytes_read = len - byte_index; return (char *) (((struct buffer *) payload)->buf) + byte_index; } } int main() { TSParser *parser = ts_parser_new(); ts_parser_set_language(parser, tree_sitter_c()); /* Copy the file into BUFFER. */ FILE *file = fopen("xdisp.c", "rb"); fseek(file, 0, SEEK_END); long length = ftell (file); fseek(file, 0, SEEK_SET); char *buffer = malloc (length); fread(buffer, 1, length, file); fclose (file); struct buffer buf = {buffer, length}; TSInput input = {&buf, read_file, TSInputEncodingUTF8}; TSTree *tree = ts_parser_parse(parser, NULL, input); TSTree *new_tree = ts_parser_parse(parser, tree, input); free(buffer); ts_tree_delete(tree); ts_tree_delete(new_tree); ts_parser_delete(parser); return 0; } [-- Attachment #4: main-json.c --] [-- Type: application/octet-stream, Size: 1195 bytes --] #include <string.h> #include <stdio.h> #include <tree_sitter/api.h> TSLanguage *tree_sitter_json(); struct buffer { char *buf; long len; }; const char *read_file(void *payload, uint32_t byte_index, TSPoint position, uint32_t *bytes_read) { long len = ((struct buffer *) payload)->len; if (byte_index >= len) { *bytes_read = 0; return (char *) ""; } else { *bytes_read = len - byte_index; return (char *) (((struct buffer *) payload)->buf) + byte_index; } } int main() { TSParser *parser = ts_parser_new(); ts_parser_set_language(parser, tree_sitter_json()); /* Copy the file into BUFFER. */ FILE *file = fopen("benchmark.3.json", "rb"); fseek(file, 0, SEEK_END); long length = ftell (file); fseek(file, 0, SEEK_SET); char *buffer = malloc (length); fread(buffer, 1, length, file); fclose (file); struct buffer buf = {buffer, length}; TSInput input = {&buf, read_file, TSInputEncodingUTF8}; TSTree *tree = ts_parser_parse(parser, NULL, input); TSTree *new_tree = ts_parser_parse(parser, tree, input); free(buffer); ts_tree_delete(tree); ts_tree_delete(new_tree); ts_parser_delete(parser); return 0; } ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-19 15:16 ` Yuan Fu @ 2021-07-22 3:10 ` Yuan Fu 2021-07-22 8:23 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-07-22 3:10 UTC (permalink / raw) To: Stefan Monnier; +Cc: Clément Pit-Claudel, emacs-devel [-- Attachment #1: Type: text/plain, Size: 579 bytes --] Here is another patch. No big progress since I’m busy moving this week. In this patch I changed from using change hooks to directly updating the trees in edit functions. I also added some node api and tests. Should I keep posting patches, or should I create a branch in /scratch? If the latter, how do I do it? I’m aware of the ongoing enlightening discussion on potential optimizations for tree-sitter. My plan is to first complete the api and implement some minimal structural editing/font-lock features, then we can concretely measure what needs to improve. Yuan [-- Attachment #2: ts.3.patch --] [-- Type: application/octet-stream, Size: 24276 bytes --] From fd8ad36fe5ea3b9b12e80879b7434b8bc67b53db Mon Sep 17 00:00:00 2001 From: Yuan Fu <casouri@gmail.com> Date: Wed, 21 Jul 2021 22:43:07 -0400 Subject: [PATCH] checkpoint 3 - change_hook -> directly in edit functions - add a need_reparse field in Lisp_TS_Parser - more node api - tests --- src/insdel.c | 43 ++++-- src/tree_sitter.c | 274 ++++++++++++++++++++++++++++------ src/tree_sitter.h | 18 ++- test/src/tree-sitter-tests.el | 106 +++++++++++++ 4 files changed, 377 insertions(+), 64 deletions(-) create mode 100644 test/src/tree-sitter-tests.el diff --git a/src/insdel.c b/src/insdel.c index 3c1e13d38b..b313c50cda 100644 --- a/src/insdel.c +++ b/src/insdel.c @@ -947,6 +947,10 @@ insert_1_both (const char *string, adjust_point (nchars, nbytes); check_markers (); + +#ifdef HAVE_TREE_SITTER + ts_record_change (PT_BYTE - nbytes, PT_BYTE - nbytes, PT_BYTE); +#endif } \f /* Insert the part of the text of STRING, a Lisp object assumed to be @@ -1078,6 +1082,10 @@ insert_from_string_1 (Lisp_Object string, ptrdiff_t pos, ptrdiff_t pos_byte, adjust_point (nchars, outgoing_nbytes); check_markers (); + +#ifdef HAVE_TREE_SITTER + ts_record_change (PT_BYTE - nbytes, PT_BYTE - nbytes, PT_BYTE); +#endif } \f /* Insert a sequence of NCHARS chars which occupy NBYTES bytes @@ -1145,6 +1153,10 @@ insert_from_gap (ptrdiff_t nchars, ptrdiff_t nbytes, bool text_at_gap_tail) adjust_point (nchars, nbytes); check_markers (); + +#ifdef HAVE_TREE_SITTER + ts_record_change (PT_BYTE - nbytes, PT_BYTE - nbytes, nbytes); +#endif } \f /* Insert text from BUF, NCHARS characters starting at CHARPOS, into the @@ -1292,6 +1304,11 @@ insert_from_buffer_1 (struct buffer *buf, graft_intervals_into_buffer (intervals, PT, nchars, current_buffer, inherit); adjust_point (nchars, outgoing_nbytes); + +#ifdef HAVE_TREE_SITTER + ts_record_change (PT_BYTE - outgoing_nbytes, + PT_BYTE - outgoing_nbytes, PT_BYTE); +#endif } \f /* Record undo information and adjust markers and position keepers for @@ -1556,6 +1573,11 @@ replace_range (ptrdiff_t from, ptrdiff_t to, Lisp_Object new, if (adjust_match_data) update_search_regs (from, to, from + SCHARS (new)); + +#ifdef HAVE_TREE_SITTER + ts_record_change (from_byte, to_byte, GPT_BYTE); +#endif + signal_after_change (from, nchars_del, GPT - from); update_compositions (from, GPT, CHECK_BORDER); } @@ -1683,6 +1705,11 @@ replace_range_2 (ptrdiff_t from, ptrdiff_t from_byte, modiff_incr (&MODIFF); CHARS_MODIFF = MODIFF; + +#ifdef HAVE_TREE_SITTER + ts_record_change (from_byte, to_byte, from_byte + insbytes); +#endif + } \f /* Delete characters in current buffer @@ -1893,6 +1920,10 @@ del_range_2 (ptrdiff_t from, ptrdiff_t from_byte, evaporate_overlays (from); +#ifdef HAVE_TREE_SITTER + ts_record_change (from_byte, to_byte, from_byte); +#endif + return deletion; } @@ -2156,11 +2187,6 @@ signal_before_change (ptrdiff_t start_int, ptrdiff_t end_int, run_hook (Qfirst_change_hook); } -#ifdef HAVE_TREE_SITTER - /* FIXME: Is this the best place? */ - ts_before_change (start_int, end_int); -#endif - /* Now run the before-change-functions if any. */ if (!NILP (Vbefore_change_functions)) { @@ -2214,13 +2240,6 @@ signal_after_change (ptrdiff_t charpos, ptrdiff_t lendel, ptrdiff_t lenins) if (inhibit_modification_hooks) return; -#ifdef HAVE_TREE_SITTER - /* We disrespect combine-after-change, because if we don't record - this change, the information that we need (the end byte position - of the change) will be lost. */ - ts_after_change (charpos, lendel, lenins); -#endif - /* If we are deferring calls to the after-change functions and there are no before-change functions, just record the args that we were going to use. */ diff --git a/src/tree_sitter.c b/src/tree_sitter.c index 7d1225161c..a6a8912c84 100644 --- a/src/tree_sitter.c +++ b/src/tree_sitter.c @@ -32,49 +32,52 @@ Copyright (C) 2021 Free Software Foundation, Inc. #include "coding.h" #include "tree_sitter.h" -/* parser.h defines a macro ADVANCE that conflicts with alloc.c. */ +/* parser.h defines a macro ADVANCE that conflicts with alloc.c. */ #include <tree_sitter/parser.h> -/* Record the byte position of the end of the (to-be) changed text. -We have to record it now, because by the time we get to after-change -hook, the _byte_ position of the end is lost. */ -void -ts_before_change (ptrdiff_t start_int, ptrdiff_t end_int) +DEFUN ("tree-sitter-parser-p", + Ftree_sitter_parser_p, Stree_sitter_parser_p, 1, 1, 0, + doc: /* Return t if OBJECT is a tree-sitter parser. */) + (Lisp_Object object) { - /* Iterate through each parser in 'tree-sitter-parser-list' and - record the byte position. There could be better ways to record - it than storing the same position in every parser, but this is - the most fool-proof way, and I expect a buffer to have only one - parser most of the time anyway. */ - ptrdiff_t beg_byte = CHAR_TO_BYTE (start_int); - ptrdiff_t old_end_byte = CHAR_TO_BYTE (end_int); - Lisp_Object parser_list = Fsymbol_value (Qtree_sitter_parser_list); - while (!NILP (parser_list)) - { - Lisp_Object lisp_parser = Fcar (parser_list); - XTS_PARSER (lisp_parser)->edit.start_byte = beg_byte; - XTS_PARSER (lisp_parser)->edit.old_end_byte = old_end_byte; - parser_list = Fcdr (parser_list); - } + if (TS_PARSERP (object)) + return Qt; + else + return Qnil; +} + +DEFUN ("tree-sitter-node-p", + Ftree_sitter_node_p, Stree_sitter_node_p, 1, 1, 0, + doc: /* Return t if OBJECT is a tree-sitter node. */) + (Lisp_Object object) +{ + if (TS_NODEP (object)) + return Qt; + else + return Qnil; } /* Update each parser's tree after the user made an edit. This function does not parse the buffer and only updates the tree. (So it should be very fast.) */ void -ts_after_change (ptrdiff_t charpos, ptrdiff_t lendel, ptrdiff_t lenins) +ts_record_change (ptrdiff_t start_byte, ptrdiff_t old_end_byte, + ptrdiff_t new_end_byte) { - ptrdiff_t new_end_byte = CHAR_TO_BYTE (charpos + lenins); Lisp_Object parser_list = Fsymbol_value (Qtree_sitter_parser_list); + TSPoint dummy_point = {0, 0}; + TSInputEdit edit = {start_byte, old_end_byte, new_end_byte, + dummy_point, dummy_point, dummy_point}; while (!NILP (parser_list)) { Lisp_Object lisp_parser = Fcar (parser_list); TSTree *tree = XTS_PARSER (lisp_parser)->tree; - XTS_PARSER (lisp_parser)->edit.new_end_byte = new_end_byte; if (tree != NULL) - ts_tree_edit (tree, &XTS_PARSER (lisp_parser)->edit); + ts_tree_edit (tree, &edit); + XTS_PARSER (lisp_parser)->need_reparse = true; parser_list = Fcdr (parser_list); } + } /* Parse the buffer. We don't parse until we have to. When we have @@ -82,11 +85,15 @@ ts_after_change (ptrdiff_t charpos, ptrdiff_t lendel, ptrdiff_t lenins) void ts_ensure_parsed (Lisp_Object parser) { + if (!XTS_PARSER (parser)->need_reparse) + return; TSParser *ts_parser = XTS_PARSER (parser)->parser; TSTree *tree = XTS_PARSER(parser)->tree; TSInput input = XTS_PARSER (parser)->input; TSTree *new_tree = ts_parser_parse(ts_parser, tree, input); + ts_tree_delete (tree); XTS_PARSER (parser)->tree = new_tree; + XTS_PARSER (parser)->need_reparse = false; } /* This is the read function provided to tree-sitter to read from a @@ -96,33 +103,30 @@ ts_ensure_parsed (Lisp_Object parser) ts_read_buffer (void *buffer, uint32_t byte_index, TSPoint position, uint32_t *bytes_read) { - if (! BUFFER_LIVE_P ((struct buffer *) buffer)) + if (!BUFFER_LIVE_P ((struct buffer *) buffer)) error ("BUFFER is not live"); ptrdiff_t byte_pos = byte_index + 1; - // FIXME: Add some boundary checks? - /* I believe we can get away with only setting current-buffer - and not actually switching to it, like what we did in - 'make_gap_1'. */ - struct buffer *old_buffer = current_buffer; - current_buffer = (struct buffer *) buffer; - - /* Read one character. */ + /* Read one character. Tree-sitter wants us to set bytes_read to 0 + if it reads to the end of buffer. It doesn't say what it wants + for the return value in that case, so we just give it an empty + string. */ char *beg; int len; - if (byte_pos >= Z_BYTE) + // TODO BUF_ZV_BYTE? + if (byte_pos >= BUF_Z_BYTE ((struct buffer *) buffer)) { beg = ""; len = 0; } else { - beg = (char *) BYTE_POS_ADDR (byte_pos); + beg = (char *) BUF_BYTE_ADDRESS (buffer, byte_pos); len = next_char_len(byte_pos); } *bytes_read = (uint32_t) len; - current_buffer = old_buffer; + return beg; } @@ -137,9 +141,7 @@ make_ts_parser (struct buffer *buffer, TSParser *parser, TSTree *tree) lisp_parser->tree = tree; TSInput input = {buffer, ts_read_buffer, TSInputEncodingUTF8}; lisp_parser->input = input; - TSPoint dummy_point = {0, 0}; - TSInputEdit edit = {0, 0, 0, dummy_point, dummy_point, dummy_point}; - lisp_parser->edit = edit; + lisp_parser->need_reparse = true; return make_lisp_ptr (lisp_parser, Lisp_Vectorlike); } @@ -192,6 +194,7 @@ DEFUN ("tree-sitter-parser-root-node", doc: /* Return the root node of PARSER. */) (Lisp_Object parser) { + CHECK_TS_PARSER (parser); ts_ensure_parsed(parser); TSNode root_node = ts_tree_root_node (XTS_PARSER (parser)->tree); return make_ts_node (parser, root_node); @@ -229,11 +232,29 @@ DEFUN ("tree-sitter-parse", Ftree_sitter_parse, Stree_sitter_parse, return lisp_node; } +/* Below this point are uninteresting mechanical translations of + tree-sitter API. */ + +/* Node functions. */ + +DEFUN ("tree-sitter-node-type", + Ftree_sitter_node_type, Stree_sitter_node_type, 1, 1, 0, + doc: /* Return the NODE's type as a symbol. */) + (Lisp_Object node) +{ + CHECK_TS_NODE (node); + TSNode ts_node = XTS_NODE (node)->node; + const char *type = ts_node_type(ts_node); + return intern_c_string (type); +} + + DEFUN ("tree-sitter-node-string", Ftree_sitter_node_string, Stree_sitter_node_string, 1, 1, 0, doc: /* Return the string representation of NODE. */) (Lisp_Object node) { + CHECK_TS_NODE (node); TSNode ts_node = XTS_NODE (node)->node; char *string = ts_node_string(ts_node); return make_string(string, strlen (string)); @@ -242,29 +263,125 @@ DEFUN ("tree-sitter-node-string", DEFUN ("tree-sitter-node-parent", Ftree_sitter_node_parent, Stree_sitter_node_parent, 1, 1, 0, doc: /* Return the immediate parent of NODE. -Return nil if we couldn't find any. */) +Return nil if there isn't any. */) (Lisp_Object node) { + CHECK_TS_NODE (node); TSNode ts_node = XTS_NODE (node)->node; - TSNode parent = ts_node_parent(ts_node); + TSNode parent = ts_node_parent (ts_node); if (ts_node_is_null(parent)) return Qnil; - return make_ts_node(XTS_NODE (node)->parser, parent); + return make_ts_node (XTS_NODE (node)->parser, parent); } DEFUN ("tree-sitter-node-child", - Ftree_sitter_node_child, Stree_sitter_node_child, 2, 2, 0, + Ftree_sitter_node_child, Stree_sitter_node_child, 2, 3, 0, doc: /* Return the Nth child of NODE. -Return nil if we couldn't find any. */) +Return nil if there isn't any. If NAMED is non-nil, look for named +child only. NAMED defaults to nil. */) + (Lisp_Object node, Lisp_Object n, Lisp_Object named) +{ + CHECK_TS_NODE (node); + CHECK_INTEGER (n); + EMACS_INT idx = XFIXNUM (n); + TSNode ts_node = XTS_NODE (node)->node; + TSNode child; + if (NILP (named)) + child = ts_node_child (ts_node, (uint32_t) idx); + else + child = ts_node_named_child (ts_node, (uint32_t) idx); + + if (ts_node_is_null(child)) + return Qnil; + + return make_ts_node(XTS_NODE (node)->parser, child); +} + +DEFUN ("tree-sitter-node-check", + Ftree_sitter_node_check, Stree_sitter_node_check, 2, 2, 0, + doc: /* Return non-nil if NODE is in condition COND, nil otherwise. + +COND could be 'named, 'missing, 'extra, 'has-error. Named nodes +correspond to named rules in the grammar, whereas "anonymous" nodes +correspond to string literals in the grammar. + +Missing nodes are inserted by the parser in order to recover from +certain kinds of syntax errors, i.e., should be there but not there. + +Extra nodes represent things like comments, which are not required the +grammar, but can appear anywhere. + +A node "has error" if itself is a syntax error or contains any syntax +errors. */) + (Lisp_Object node, Lisp_Object cond) +{ + CHECK_TS_NODE (node); + CHECK_SYMBOL (cond); + TSNode ts_node = XTS_NODE (node)->node; + bool result; + if (EQ (cond, Qnamed)) + result = ts_node_is_named (ts_node); + else if (EQ (cond, Qmissing)) + result = ts_node_is_missing (ts_node); + else if (EQ (cond, Qextra)) + result = ts_node_is_extra (ts_node); + else if (EQ (cond, Qhas_error)) + result = ts_node_has_error (ts_node); + else + signal_error ("Expecting one of four symbols, see docstring", cond); + return result ? Qt : Qnil; +} + +DEFUN ("tree-sitter-node-field-name-for-child", + Ftree_sitter_node_field_name_for_child, + Stree_sitter_node_field_name_for_child, 2, 2, 0, + doc: /* Return the field name of the Nth child of NODE. +Return nil if there isn't any child or no field is found. */) (Lisp_Object node, Lisp_Object n) { CHECK_INTEGER (n); EMACS_INT idx = XFIXNUM (n); TSNode ts_node = XTS_NODE (node)->node; - // FIXME: Is this cast ok? - TSNode child = ts_node_child(ts_node, (uint32_t) idx); + const char *name + = ts_node_field_name_for_child (ts_node, (uint32_t) idx); + + if (name == NULL) + return Qnil; + + return make_string (name, strlen (name)); +} + +DEFUN ("tree-sitter-node-child-count", + Ftree_sitter_node_child_count, + Stree_sitter_node_child_count, 1, 2, 0, + doc: /* Return the number of children of NODE. +If NAMED is non-nil, count named child only. NAMED defaults to +nil. */) + (Lisp_Object node, Lisp_Object named) +{ + TSNode ts_node = XTS_NODE (node)->node; + uint32_t count; + if (NILP (named)) + count = ts_node_child_count (ts_node); + else + count = ts_node_named_child_count (ts_node); + return make_fixnum (count); +} + +DEFUN ("tree-sitter-node-child-by-field-name", + Ftree_sitter_node_child_by_field_name, + Stree_sitter_node_child_by_field_name, 2, 2, 0, + doc: /* Return the child of NODE with field name NAME. +Return nil if there isn't any. */) + (Lisp_Object node, Lisp_Object name) +{ + CHECK_STRING (name); + char *name_str = SSDATA (name); + TSNode ts_node = XTS_NODE (node)->node; + TSNode child + = ts_node_child_by_field_name (ts_node, name_str, strlen (name_str)); if (ts_node_is_null(child)) return Qnil; @@ -272,10 +389,62 @@ DEFUN ("tree-sitter-node-child", return make_ts_node(XTS_NODE (node)->parser, child); } +DEFUN ("tree-sitter-node-next-sibling", + Ftree_sitter_node_next_sibling, + Stree_sitter_node_next_sibling, 1, 2, 0, + doc: /* Return the next sibling of NODE. +Return nil if there isn't any. If NAMED is non-nil, look for named +child only. NAMED defaults to nil. */) + (Lisp_Object node, Lisp_Object named) +{ + TSNode ts_node = XTS_NODE (node)->node; + TSNode sibling; + if (NILP (named)) + sibling = ts_node_next_sibling (ts_node); + else + sibling = ts_node_next_named_sibling (ts_node); + + if (ts_node_is_null(sibling)) + return Qnil; + + return make_ts_node(XTS_NODE (node)->parser, sibling); +} + +DEFUN ("tree-sitter-node-prev-sibling", + Ftree_sitter_node_prev_sibling, + Stree_sitter_node_prev_sibling, 1, 2, 0, + doc: /* Return the previous sibling of NODE. +Return nil if there isn't any. If NAMED is non-nil, look for named +child only. NAMED defaults to nil. */) + (Lisp_Object node, Lisp_Object named) +{ + TSNode ts_node = XTS_NODE (node)->node; + TSNode sibling; + + if (NILP (named)) + sibling = ts_node_prev_sibling (ts_node); + else + sibling = ts_node_prev_named_sibling (ts_node); + + if (ts_node_is_null(sibling)) + return Qnil; + + return make_ts_node(XTS_NODE (node)->parser, sibling); +} + +/* Query functions */ + /* Initialize the tree-sitter routines. */ void syms_of_tree_sitter (void) { + DEFSYM (Qtree_sitter_parser_p, "tree-sitter-parser-p"); + DEFSYM (Qtree_sitter_node_p, "tree-sitter-node-p"); + DEFSYM (Qnamed, "named"); + DEFSYM (Qmissing, "missing"); + DEFSYM (Qextra, "extra"); + DEFSYM (Qhas_error, "has-error"); + DEFSYM (Qtree_sitter_parser_list, "tree-sitter-parser-list"); DEFVAR_LISP ("ts-parser-list", Vtree_sitter_parser_list, doc: /* A list of tree-sitter parsers. @@ -284,11 +453,20 @@ syms_of_tree_sitter (void) Vtree_sitter_parser_list = Qnil; Fmake_variable_buffer_local (Qtree_sitter_parser_list); - + defsubr (&Stree_sitter_parser_p); + defsubr (&Stree_sitter_node_p); defsubr (&Stree_sitter_create_parser); defsubr (&Stree_sitter_parser_root_node); defsubr (&Stree_sitter_parse); + + defsubr (&Stree_sitter_node_type); defsubr (&Stree_sitter_node_string); defsubr (&Stree_sitter_node_parent); defsubr (&Stree_sitter_node_child); + defsubr (&Stree_sitter_node_check); + defsubr (&Stree_sitter_node_field_name_for_child); + defsubr (&Stree_sitter_node_child_count); + defsubr (&Stree_sitter_node_child_by_field_name); + defsubr (&Stree_sitter_node_next_sibling); + defsubr (&Stree_sitter_node_prev_sibling); } diff --git a/src/tree_sitter.h b/src/tree_sitter.h index 0606f336cc..a7e2a2d670 100644 --- a/src/tree_sitter.h +++ b/src/tree_sitter.h @@ -37,7 +37,7 @@ #define EMACS_TREE_SITTER_H TSParser *parser; TSTree *tree; TSInput input; - TSInputEdit edit; + bool need_reparse; }; /* A wrapper around a tree-sitter node. */ @@ -78,11 +78,21 @@ XTS_NODE (Lisp_Object a) return XUNTAG (a, Lisp_Vectorlike, struct Lisp_TS_Node); } -void -ts_before_change (ptrdiff_t charpos, ptrdiff_t lendel); +INLINE void +CHECK_TS_PARSER (Lisp_Object parser) +{ + CHECK_TYPE (TS_PARSERP (parser), Qtree_sitter_parser_p, parser); +} + +INLINE void +CHECK_TS_NODE (Lisp_Object node) +{ + CHECK_TYPE (TS_NODEP (node), Qtree_sitter_node_p, node); +} void -ts_after_change (ptrdiff_t charpos, ptrdiff_t lendel, ptrdiff_t lenins); +ts_record_change (ptrdiff_t start_byte, ptrdiff_t old_end_byte, + ptrdiff_t new_end_byte); Lisp_Object make_ts_parser (struct buffer *buffer, TSParser *parser, TSTree *tree); diff --git a/test/src/tree-sitter-tests.el b/test/src/tree-sitter-tests.el new file mode 100644 index 0000000000..cb1c464d3a --- /dev/null +++ b/test/src/tree-sitter-tests.el @@ -0,0 +1,106 @@ +;;; tree-sitter-tests.el --- tests for src/tree-sitter.c -*- lexical-binding: t; -*- + +;; Copyright (C) 2021 Free Software Foundation, Inc. + +;; This file is part of GNU Emacs. + +;; GNU Emacs is free software: you can redistribute it and/or modify +;; it under the terms of the GNU General Public License as published by +;; the Free Software Foundation, either version 3 of the License, or +;; (at your option) any later version. + +;; GNU Emacs is distributed in the hope that it will be useful, +;; but WITHOUT ANY WARRANTY; without even the implied warranty of +;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +;; GNU General Public License for more details. + +;; You should have received a copy of the GNU General Public License +;; along with GNU Emacs. If not, see <https://www.gnu.org/licenses/>. + +;;; Code: + +(require 'ert) +(require 'tree-sitter-json) + +(ert-deftest tree-sitter-basic-parsing () + "Test basic parsing routines." + (with-temp-buffer + (let ((parser (tree-sitter-create-parser + (current-buffer) (tree-sitter-json)))) + (should + (eq parser (car tree-sitter-parser-list))) + (should + (equal (tree-sitter-node-string + (tree-sitter-parser-root-node parser)) + "(ERROR)")) + + (insert "[1,2,3]") + (should + (equal (tree-sitter-node-string + (tree-sitter-parser-root-node parser)) + "(document (array (number) (number) (number)))")) + + (goto-char (point-min)) + (forward-char 3) + (insert "{\"name\": \"Bob\"},") + (should + (equal + (tree-sitter-node-string + (tree-sitter-parser-root-node parser)) + "(document (array (number) (object (pair key: (string (string_content)) value: (string (string_content)))) (number) (number)))"))))) + +(ert-deftest tree-sitter-node-api () + "Tests for node API." + (with-temp-buffer + (insert "[1,2,{\"name\": \"Bob\"},3]") + (let (parser root-node doc-node object-node pair-node) + (setq parser (tree-sitter-create-parser + (current-buffer) (tree-sitter-json))) + (setq root-node (tree-sitter-parser-root-node + parser)) + ;; `tree-sitter-node-type'. + (should (eq 'document (tree-sitter-node-type root-node))) + ;; `tree-sitter-node-check'. + (should (eq t (tree-sitter-node-check root-node 'named))) + (should (eq nil (tree-sitter-node-check root-node 'missing))) + (should (eq nil (tree-sitter-node-check root-node 'extra))) + (should (eq nil (tree-sitter-node-check root-node 'has-error))) + ;; `tree-sitter-node-child'. + (setq doc-node (tree-sitter-node-child root-node 0)) + (should (eq 'array (tree-sitter-node-type doc-node))) + (should (equal (tree-sitter-node-string doc-node) + "(array (number) (number) (object (pair key: (string (string_content)) value: (string (string_content)))) (number))")) + ;; `tree-sitter-node-child-count'. + (should (eql 9 (tree-sitter-node-child-count doc-node))) + (should (eql 4 (tree-sitter-node-child-count doc-node t))) + ;; `tree-sitter-node-field-name-for-child'. + (setq object-node (tree-sitter-node-child doc-node 2 t)) + (setq pair-node (tree-sitter-node-child object-node 0 t)) + (should (eq 'object (tree-sitter-node-type object-node))) + (should (eq 'pair (tree-sitter-node-type pair-node))) + (should (equal "key" + (tree-sitter-node-field-name-for-child + pair-node 0))) + ;; `tree-sitter-node-child-by-field-name'. + (should (equal "(string (string_content))" + (tree-sitter-node-string + (tree-sitter-node-child-by-field-name + pair-node "key")))) + ;; `tree-sitter-node-next-sibling'. + (should (equal "(number)" + (tree-sitter-node-string + (tree-sitter-node-next-sibling object-node t)))) + (should (equal "(\",\")" + (tree-sitter-node-string + (tree-sitter-node-next-sibling object-node)))) + ;; `tree-sitter-node-prev-sibling'. + (should (equal "(number)" + (tree-sitter-node-string + (tree-sitter-node-prev-sibling object-node t)))) + (should (equal "(\",\")" + (tree-sitter-node-string + (tree-sitter-node-prev-sibling object-node)))) + ))) + +(provide 'tree-sitter-tests) +;;; tree-sitter-tests.el ends here -- 2.24.3 (Apple Git-128) ^ permalink raw reply related [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-22 3:10 ` Yuan Fu @ 2021-07-22 8:23 ` Eli Zaretskii 2021-07-22 13:47 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-22 8:23 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Wed, 21 Jul 2021 23:10:14 -0400 > Cc: Clément Pit-Claudel <cpitclaudel@gmail.com>, > emacs-devel <emacs-devel@gnu.org> > > Should I keep posting patches, or should I create a branch in /scratch? The latter, I think. > If the latter, how do I do it? You need write access to the Emacs repository. > @@ -96,33 +103,30 @@ ts_ensure_parsed (Lisp_Object parser) > ts_read_buffer (void *buffer, uint32_t byte_index, > TSPoint position, uint32_t *bytes_read) > { > - if (! BUFFER_LIVE_P ((struct buffer *) buffer)) > + if (!BUFFER_LIVE_P ((struct buffer *) buffer)) > error ("BUFFER is not live"); Is it really TRT to signal an error here? This is not code that would run from a user command, so signaling an error is not necessarily the useful response to this situation. Why not simply return without doing anything? > + // TODO BUF_ZV_BYTE? Do you want to discuss this? I'd prefer to have it the other way around: use BUF_ZV_BYTE by default. The callers could widen the buffer if they needed to access outside of the narrowing. > else > { > - beg = (char *) BYTE_POS_ADDR (byte_pos); > + beg = (char *) BUF_BYTE_ADDRESS (buffer, byte_pos); > len = next_char_len(byte_pos); The last line is incorrect, as it assumes the current buffer. You actually don't need that function, it's enough to use BYTES_BY_CHAR_HEAD on the address in 'beg'. > *bytes_read = (uint32_t) len; Is using uint32_t the restriction of tree-sitter? Doesn't it support reading more than 2 gigabytes? > +DEFUN ("tree-sitter-node-type", > + Ftree_sitter_node_type, Stree_sitter_node_type, 1, 1, 0, > + doc: /* Return the NODE's type as a symbol. */) > + (Lisp_Object node) > +{ > + CHECK_TS_NODE (node); > + TSNode ts_node = XTS_NODE (node)->node; > + const char *type = ts_node_type(ts_node); > + return intern_c_string (type); Why do we need to intern the string each time? can't we store the interned symbol there, instead of a C string, in the first place? Thanks. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-22 8:23 ` Eli Zaretskii @ 2021-07-22 13:47 ` Yuan Fu 2021-07-22 14:11 ` Óscar Fuentes ` (2 more replies) 0 siblings, 3 replies; 370+ messages in thread From: Yuan Fu @ 2021-07-22 13:47 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Clément Pit-Claudel, Stefan Monnier, emacs-devel [-- Attachment #1: Type: text/plain, Size: 3092 bytes --] > >> + // TODO BUF_ZV_BYTE? > > Do you want to discuss this? I'd prefer to have it the other way > around: use BUF_ZV_BYTE by default. The callers could widen the > buffer if they needed to access outside of the narrowing. Yes, I meant to discuss this. The problem with respecting narrowing is that, a user can freely narrow and widen arbitrarily, and Emacs needs to translate them into insertion & deletion of the buffer text for tree-sitter, every time a user narrows or widens the buffer. Plus, if tree-sitter respects narrowing, it could happen where a user narrows the buffer, the font-locking changes and is not correct anymore. Maybe that’s not the user want. Also, if someone narrows and widens often, maybe narrow to a function for better focus, tree-sitter needs to constantly re-parse most of the buffer. These are not significant disadvantages, but what do we get from respecting narrowing that justifies code complexity and these small annoyances? >> *bytes_read = (uint32_t) len; > > Is using uint32_t the restriction of tree-sitter? Doesn't it support > reading more than 2 gigabytes? I’m not sure why it asks for uint32 specifically, but that’s what it asks for its api. I don’t think you are supposed to use tree-sitter on files of size of gigabytes, because the author mentioned that tree-sitter uses over 10x as much memory as the size of the source file [1]. On files larger than a couple of megabytes, I think we better turn off tree-sitter. Normally those files are not regular source files, anyway, and we don’t need a parse tree for a log. That leads to another point. I suspect the memory limit will come before the speed limit, i.e., as the file size increases, the memory consumption will become unacceptable before the speed does. So it is possible that we want to outright disable tree-sitter for larger files, then we don’t need to do much to improve the responsiveness of tree-sitter on large files. And we might want to delete the parse tree if a buffer has been idle for a while. Of course, that’s just my superstition, we’ll see once we can measure the performance. > >> +DEFUN ("tree-sitter-node-type", >> + Ftree_sitter_node_type, Stree_sitter_node_type, 1, 1, 0, >> + doc: /* Return the NODE's type as a symbol. */) >> + (Lisp_Object node) >> +{ >> + CHECK_TS_NODE (node); >> + TSNode ts_node = XTS_NODE (node)->node; >> + const char *type = ts_node_type(ts_node); >> + return intern_c_string (type); > > Why do we need to intern the string each time? can't we store the > interned symbol there, instead of a C string, in the first place? I’m not sure what do you mean by “store the interned symbol there”, where do I store the interned symbol? (BTW, If you see something wrong, that’s probably because I don’t know the right way to do it, and grepping only got me that far.) [1]: https://github.com/tree-sitter/tree-sitter/issues/222#issuecomment-435987441 <https://github.com/tree-sitter/tree-sitter/issues/222#issuecomment-435987441> Thanks, Yuan [-- Attachment #2: Type: text/html, Size: 4440 bytes --] ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-22 13:47 ` Yuan Fu @ 2021-07-22 14:11 ` Óscar Fuentes 2021-07-22 17:09 ` Eli Zaretskii 2021-07-22 17:00 ` Eli Zaretskii 2021-07-24 9:33 ` Stephen Leake 2 siblings, 1 reply; 370+ messages in thread From: Óscar Fuentes @ 2021-07-22 14:11 UTC (permalink / raw) To: emacs-devel Yuan Fu <casouri@gmail.com> writes: > That leads to another point. I suspect the memory limit will come > before the speed limit, i.e., as the file size increases, the memory > consumption will become unacceptable before the speed does. So it is > possible that we want to outright disable tree-sitter for larger > files, then we don’t need to do much to improve the responsiveness of > tree-sitter on large files. And we might want to delete the parse tree > if a buffer has been idle for a while. Of course, that’s just my > superstition, we’ll see once we can measure the performance. Of course those parameters would be configurable on Emacs, but disabling TS on a 2MB file because it uses 20MB is way too conservative, IMHO. Nowadays the cheapest netbook comes with at least 1GB RAM and can do memory-to-memory copies at a rate of GB/s. Guys, you are speculating too much about minutia and worst-case scenarios. (Do we really care about TS not supporting files larger than 4GB? I mean, REALLY?) I'll rather focus on implementing the thing and optimize later. My bet is that a crude implementation would work fine for the 99% of the users and be an improvement over what we have now on practically all cases. BTW, a 10x AST/source-code size ratio is quite reasonable. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-22 14:11 ` Óscar Fuentes @ 2021-07-22 17:09 ` Eli Zaretskii 2021-07-22 19:29 ` Óscar Fuentes 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-22 17:09 UTC (permalink / raw) To: Óscar Fuentes; +Cc: emacs-devel > From: Óscar Fuentes <ofv@wanadoo.es> > Date: Thu, 22 Jul 2021 16:11:09 +0200 > > Yuan Fu <casouri@gmail.com> writes: > > > That leads to another point. I suspect the memory limit will come > > before the speed limit, i.e., as the file size increases, the memory > > consumption will become unacceptable before the speed does. So it is > > possible that we want to outright disable tree-sitter for larger > > files, then we don’t need to do much to improve the responsiveness of > > tree-sitter on large files. And we might want to delete the parse tree > > if a buffer has been idle for a while. Of course, that’s just my > > superstition, we’ll see once we can measure the performance. > > Of course those parameters would be configurable on Emacs, but disabling > TS on a 2MB file because it uses 20MB is way too conservative, IMHO. Why would we limit ourselves to 20MB? uint32_t supports upto 4GB. > Guys, you are speculating too much about minutia and worst-case > scenarios. (Do we really care about TS not supporting files larger than > 4GB? I mean, REALLY?) Yes, we do. For at least 2 reasons: (a) source code files produced by programs can be very large; (b) having a feature that fails before you reach the max size of a buffer Emacs supports is a problem, because it will cause hard-to-deal-with problems. Or let me turn the table and ask why we cared to support the largest possible buffer size when 32-bit systems were the rule? > I'll rather focus on implementing the thing and optimize later. My bet > is that a crude implementation would work fine for the 99% of the users > and be an improvement over what we have now on practically all cases. This is not a prototype project. (Or at least I hope it won't end up being that.) This is supposed to be the industry-strength code that core Emacs will use for the years to come to support features which need language-dependent parsing. It cannot work correctly only in 99% of use cases. So we must assess the limitations seriously and plan ahead for them. > BTW, a 10x AST/source-code size ratio is quite reasonable. It could be, but please don't forget that this is _in_addition_to_ the "normal" Emacs memory footprint, and that could easily be 1GB and sometimes several times that. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-22 17:09 ` Eli Zaretskii @ 2021-07-22 19:29 ` Óscar Fuentes 2021-07-23 5:21 ` Eli Zaretskii 2021-07-24 9:38 ` Stephen Leake 0 siblings, 2 replies; 370+ messages in thread From: Óscar Fuentes @ 2021-07-22 19:29 UTC (permalink / raw) To: emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> Of course those parameters would be configurable on Emacs, but disabling >> TS on a 2MB file because it uses 20MB is way too conservative, IMHO. > > Why would we limit ourselves to 20MB? uint32_t supports upto 4GB. I didn't suggest that we should limit ourselves to 20MB, I observed that current machines have enough resources for handling large files ("large" meaning "big enough to keep me busy reading for some years.") >> Guys, you are speculating too much about minutia and worst-case >> scenarios. (Do we really care about TS not supporting files larger than >> 4GB? I mean, REALLY?) > > Yes, we do. For at least 2 reasons: (a) source code files produced by > programs can be very large; I know, I work with machine-generated (read: code-dense) 20+MB C++ files on a regular basis. However, I wouldn't agree on renouncing to useful features because they could be problematic when dealing with large files. That is, it would be a mistake to discard TS as inadequate for Emacs just because it doesn't benefit (and I say "not benefit", not "penalise") certain use cases. > (b) having a feature that fails before you > reach the max size of a buffer Emacs supports is a problem, because it > will cause hard-to-deal-with problems. We can put reasonable limits on when to use TS once we have some experience with it. What matters right now is if TS would be usable for the typical use case, and I guess the answer is positive. Also, it is not as if we had other options to consider. >> I'll rather focus on implementing the thing and optimize later. My bet >> is that a crude implementation would work fine for the 99% of the users >> and be an improvement over what we have now on practically all cases. > > This is not a prototype project. (Or at least I hope it won't end up > being that.) This is supposed to be the industry-strength code that > core Emacs will use for the years to come to support features which > need language-dependent parsing. It cannot work correctly only in 99% > of use cases. So we must assess the limitations seriously and plan > ahead for them. I said "would work *fine* for the 99% of users", this does not imply that it would work incorrectly for the rest. On the "planning ahead" part, TS support would be an optional, quasi-external feature for some time, it is not as if it comes out with some critical bug Emacs would become unusable. TS support can be fine-tuned without disrupting the rest of Emacs development. If, on the other hand, we start making changes on Emacs' internals for allowing some TS-related optimizations (even when we don't know if they are neccessary at all) that could be a destabilizing move for Emacs as a whole. Apart from delaying TS support. >> BTW, a 10x AST/source-code size ratio is quite reasonable. > > It could be, but please don't forget that this is _in_addition_to_ the > "normal" Emacs memory footprint, and that could easily be 1GB and > sometimes several times that. Yes, but if you want something you need to pay something, and you can hardly get TS' features with less than that. At least for complex languages like C++. Talking about scenarios of heavy memory usage, I'll comment in passing that in my recent experience, once Emacs exceeds 2GB the gc pauses start to be so annoying that I don't care anymore about how much memory an external tool would use if it works fast enough. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-22 19:29 ` Óscar Fuentes @ 2021-07-23 5:21 ` Eli Zaretskii 2021-07-24 9:38 ` Stephen Leake 1 sibling, 0 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-07-23 5:21 UTC (permalink / raw) To: Óscar Fuentes; +Cc: emacs-devel > From: Óscar Fuentes <ofv@wanadoo.es> > Date: Thu, 22 Jul 2021 21:29:02 +0200 > > >> Guys, you are speculating too much about minutia and worst-case > >> scenarios. (Do we really care about TS not supporting files larger than > >> 4GB? I mean, REALLY?) > > > > Yes, we do. For at least 2 reasons: (a) source code files produced by > > programs can be very large; > > I know, I work with machine-generated (read: code-dense) 20+MB C++ files > on a regular basis. > > However, I wouldn't agree on renouncing to useful features because they > could be problematic when dealing with large files. That is, it would be > a mistake to discard TS as inadequate for Emacs just because it doesn't > benefit (and I say "not benefit", not "penalise") certain use cases. It was not my intent to say we should discard TS as inadequate because of these limitations. What I meant is that we should know about the limitations and plan in advance how to handle them when a user bumps into them. Disabling TS-related features could be one such mitigation, but maybe we could come up with smarter fallbacks. It sounds like the rest of you message was to convince me not to give up on TS, in which case there's no need: I'm convinced already, and mostly agree with what you say. > Talking about scenarios of heavy memory usage, I'll comment in passing > that in my recent experience, once Emacs exceeds 2GB the gc pauses start > to be so annoying that I don't care anymore about how much memory an > external tool would use if it works fast enough. That's a separate issue. And the amount of memory GC has to scan is not directly related to the memory footprint of the Emacs process. So I would be interested in seeing the results of memory-report in those cases where GC takes too long (in a separate thread, please). ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-22 19:29 ` Óscar Fuentes 2021-07-23 5:21 ` Eli Zaretskii @ 2021-07-24 9:38 ` Stephen Leake 1 sibling, 0 replies; 370+ messages in thread From: Stephen Leake @ 2021-07-24 9:38 UTC (permalink / raw) To: Óscar Fuentes; +Cc: emacs-devel Óscar Fuentes <ofv@wanadoo.es> writes: > Eli Zaretskii <eliz@gnu.org> writes: >> (b) having a feature that fails before you >> reach the max size of a buffer Emacs supports is a problem, because it >> will cause hard-to-deal-with problems. > > We can put reasonable limits on when to use TS once we have some > experience with it. What matters right now is if TS would be usable for > the typical use case, and I guess the answer is positive. Also, it is > not as if we had other options to consider. wisi supports > 4G, not that I've actually tried it. And incremental parse is now working well enough to benchmark, in my devel version. -- -- Stephe ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-22 13:47 ` Yuan Fu 2021-07-22 14:11 ` Óscar Fuentes @ 2021-07-22 17:00 ` Eli Zaretskii 2021-07-22 17:47 ` Yuan Fu ` (2 more replies) 2021-07-24 9:33 ` Stephen Leake 2 siblings, 3 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-07-22 17:00 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Thu, 22 Jul 2021 09:47:45 -0400 > Cc: Stefan Monnier <monnier@iro.umontreal.ca>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > emacs-devel@gnu.org > > Yes, I meant to discuss this. The problem with respecting narrowing is that, a user can freely narrow and widen arbitrarily, and Emacs needs to translate them into insertion & deletion of the buffer text for tree-sitter, every time a user narrows or widens the buffer. Plus, if tree-sitter respects narrowing, it could happen where a user narrows the buffer, the font-locking changes and is not correct anymore. Maybe that’s not the user want. Also, if someone narrows and widens often, maybe narrow to a function for better focus, tree-sitter needs to constantly re-parse most of the buffer. These are not significant disadvantages, but what do we get from respecting narrowing that justifies code complexity and these small annoyances? But that's how the current font-lock and indentation work: they never look beyond the narrowing limits. So why should the TS-based features behave differently? As for temporary narrowing: if we record the changes, but don't send them to TS until we actually need re-parsing, then we could eliminate the temporary narrowing when we report the changes to TS, leaving only the narrowing that exists at the time of the re-parse. At least for fontifications, that time is redisplay time, and users do expect to see the text fontified according to the current narrowing. > >> *bytes_read = (uint32_t) len; > > > > Is using uint32_t the restriction of tree-sitter? Doesn't it support > > reading more than 2 gigabytes? > > I’m not sure why it asks for uint32 specifically, but that’s what it asks for its api. I don’t think you are supposed to use tree-sitter on files of size of gigabytes, because the author mentioned that tree-sitter uses over 10x as much memory as the size of the source file [1]. On files larger than a couple of megabytes, I think we better turn off tree-sitter. Normally those files are not regular source files, anyway, and we don’t need a parse tree for a log. I don't necessarily agree with the "not regular source files" part. For example, JSON files can be quite large. And there are also log files, which are even larger -- did no one adapt TS to fontifying those yet? More generally: is the problem real? If you make a file that is 1000 copies of xdisp.c, and then submit it to TS, do you really get 10GB of memory consumption? This is something that is good to know up front, so we'd know what to expect down the road. > That leads to another point. I suspect the memory limit will come before the speed limit, i.e., as the file size increases, the memory consumption will become unacceptable before the speed does. So it is possible that we want to outright disable tree-sitter for larger files, then we don’t need to do much to improve the responsiveness of tree-sitter on large files. And we might want to delete the parse tree if a buffer has been idle for a while. Of course, that’s just my superstition, we’ll see once we can measure the performance. See above: IMO, we should benchmark both the CPU and memory performance of TS for such large files, before we decide on the course of action. > >> +DEFUN ("tree-sitter-node-type", > >> + Ftree_sitter_node_type, Stree_sitter_node_type, 1, 1, 0, > >> + doc: /* Return the NODE's type as a symbol. */) > >> + (Lisp_Object node) > >> +{ > >> + CHECK_TS_NODE (node); > >> + TSNode ts_node = XTS_NODE (node)->node; > >> + const char *type = ts_node_type(ts_node); > >> + return intern_c_string (type); > > > > Why do we need to intern the string each time? can't we store the > > interned symbol there, instead of a C string, in the first place? > > I’m not sure what do you mean by “store the interned symbol there”, where do I store the interned symbol? In the struct that ts_node_type accesses, instead of the 'char *' string you store there now. > (BTW, If you see something wrong, that’s probably because I don’t know the right way to do it, and grepping only got me that far.) Do what? feel free to ask questions when you aren't sure how to accomplish something on the C level. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-22 17:00 ` Eli Zaretskii @ 2021-07-22 17:47 ` Yuan Fu 2021-07-22 19:05 ` Eli Zaretskii 2021-07-23 14:07 ` Stefan Monnier 2021-07-24 9:42 ` Stephen Leake 2 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-07-22 17:47 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Clément Pit-Claudel, Stefan Monnier, emacs-devel > On Jul 22, 2021, at 1:00 PM, Eli Zaretskii <eliz@gnu.org> wrote: > >> From: Yuan Fu <casouri@gmail.com> >> Date: Thu, 22 Jul 2021 09:47:45 -0400 >> Cc: Stefan Monnier <monnier@iro.umontreal.ca>, >> Clément Pit-Claudel <cpitclaudel@gmail.com>, >> emacs-devel@gnu.org >> >> Yes, I meant to discuss this. The problem with respecting narrowing is that, a user can freely narrow and widen arbitrarily, and Emacs needs to translate them into insertion & deletion of the buffer text for tree-sitter, every time a user narrows or widens the buffer. Plus, if tree-sitter respects narrowing, it could happen where a user narrows the buffer, the font-locking changes and is not correct anymore. Maybe that’s not the user want. Also, if someone narrows and widens often, maybe narrow to a function for better focus, tree-sitter needs to constantly re-parse most of the buffer. These are not significant disadvantages, but what do we get from respecting narrowing that justifies code complexity and these small annoyances? > > But that's how the current font-lock and indentation work: they never > look beyond the narrowing limits. So why should the TS-based features > behave differently? > > As for temporary narrowing: if we record the changes, but don't send > them to TS until we actually need re-parsing, then we could eliminate > the temporary narrowing when we report the changes to TS, leaving only > the narrowing that exists at the time of the re-parse. At least for > fontifications, that time is redisplay time, and users do expect to > see the text fontified according to the current narrowing. > >>>> *bytes_read = (uint32_t) len; >>> >>> Is using uint32_t the restriction of tree-sitter? Doesn't it support >>> reading more than 2 gigabytes? >> >> I’m not sure why it asks for uint32 specifically, but that’s what it asks for its api. I don’t think you are supposed to use tree-sitter on files of size of gigabytes, because the author mentioned that tree-sitter uses over 10x as much memory as the size of the source file [1]. On files larger than a couple of megabytes, I think we better turn off tree-sitter. Normally those files are not regular source files, anyway, and we don’t need a parse tree for a log. > > I don't necessarily agree with the "not regular source files" part. > For example, JSON files can be quite large. And there are also log > files, which are even larger -- did no one adapt TS to fontifying > those yet? There is a JSON parser, but I don’t think there is one for log files. > > More generally: is the problem real? If you make a file that is 1000 > copies of xdisp.c, and then submit it to TS, do you really get 10GB of > memory consumption? This is something that is good to know up front, > so we'd know what to expect down the road. Yes. I concatenated 100 xdisp.c together, and parsed them with my simple C program. It used 1.8 G. I didn’t test for 1000 together, but I think the trend is linear. time -l ./main-large-c 16.48 real 15.32 user 0.81 sys 1883959296 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 459951 page reclaims 22 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 6 voluntary context switches 1653 involuntary context switches 107310143182 instructions retired 58561420060 cycles elapsed 1883095040 peak memory footprint > >> That leads to another point. I suspect the memory limit will come before the speed limit, i.e., as the file size increases, the memory consumption will become unacceptable before the speed does. So it is possible that we want to outright disable tree-sitter for larger files, then we don’t need to do much to improve the responsiveness of tree-sitter on large files. And we might want to delete the parse tree if a buffer has been idle for a while. Of course, that’s just my superstition, we’ll see once we can measure the performance. > > See above: IMO, we should benchmark both the CPU and memory > performance of TS for such large files, before we decide on the course > of action. That’s my thought, too. I should have reserved my suspicion until I have benchmark measurements. > >>>> +DEFUN ("tree-sitter-node-type", >>>> + Ftree_sitter_node_type, Stree_sitter_node_type, 1, 1, 0, >>>> + doc: /* Return the NODE's type as a symbol. */) >>>> + (Lisp_Object node) >>>> +{ >>>> + CHECK_TS_NODE (node); >>>> + TSNode ts_node = XTS_NODE (node)->node; >>>> + const char *type = ts_node_type(ts_node); >>>> + return intern_c_string (type); >>> >>> Why do we need to intern the string each time? can't we store the >>> interned symbol there, instead of a C string, in the first place? >> >> I’m not sure what do you mean by “store the interned symbol there”, where do I store the interned symbol? > > In the struct that ts_node_type accesses, instead of the 'char *' > string you store there now. The struct that ts_node_type accesses is a TSNode, which is defined by tree-sitter. ts_node_type is an API provided by tree-sitter, I’m just exposing it to lisp. I could return strings instead of symbols, but I thought symbols might be more appropriate and more convenient for users of this function. >> (BTW, If you see something wrong, that’s probably because I don’t know the right way to do it, and grepping only got me that far.) > > Do what? feel free to ask questions when you aren't sure how to > accomplish something on the C level. Thanks. Is below the correct way to set a buffer-local variable? (I’m setting tree-sitter-parser-list.) struct buffer *old_buffer = current_buffer; set_buffer_internal (XBUFFER (buffer)); Fset (Qtree_sitter_parser_list, Fcons (lisp_parser, Fsymbol_value (Qtree_sitter_parser_list))); set_buffer_internal (old_buffer); Also, we don’t call change hooks in replace_range_2, why? Should I update tree-sitter trees in that function, or should I not? Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-22 17:47 ` Yuan Fu @ 2021-07-22 19:05 ` Eli Zaretskii 2021-07-23 13:25 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-22 19:05 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Thu, 22 Jul 2021 13:47:20 -0400 > Cc: Stefan Monnier <monnier@iro.umontreal.ca>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > emacs-devel@gnu.org > > > More generally: is the problem real? If you make a file that is 1000 > > copies of xdisp.c, and then submit it to TS, do you really get 10GB of > > memory consumption? This is something that is good to know up front, > > so we'd know what to expect down the road. > > Yes. I concatenated 100 xdisp.c together, and parsed them with my simple C program. It used 1.8 G. I didn’t test for 1000 together, but I think the trend is linear. That's good to know, thanks. So what does TS do if it attempts to allocate more memory and that fails? Regardless, we'd need some fallback strategy, because AFAIU many people run with VM overcommit enabled, so the OOM killer will just kill the Emacs process when it asks for too much memory. > >>>> +DEFUN ("tree-sitter-node-type", > >>>> + Ftree_sitter_node_type, Stree_sitter_node_type, 1, 1, 0, > >>>> + doc: /* Return the NODE's type as a symbol. */) > >>>> + (Lisp_Object node) > >>>> +{ > >>>> + CHECK_TS_NODE (node); > >>>> + TSNode ts_node = XTS_NODE (node)->node; > >>>> + const char *type = ts_node_type(ts_node); > >>>> + return intern_c_string (type); > >>> > >>> Why do we need to intern the string each time? can't we store the > >>> interned symbol there, instead of a C string, in the first place? > >> > >> I’m not sure what do you mean by “store the interned symbol there”, where do I store the interned symbol? > > > > In the struct that ts_node_type accesses, instead of the 'char *' > > string you store there now. > > The struct that ts_node_type accesses is a TSNode, which is defined by tree-sitter. ts_node_type is an API provided by tree-sitter, I’m just exposing it to lisp. I could return strings instead of symbols, but I thought symbols might be more appropriate and more convenient for users of this function. Maybe there's a better way of exposing that to Lisp. But that's a minor point, it can be left for later. > Is below the correct way to set a buffer-local variable? (I’m setting tree-sitter-parser-list.) > > struct buffer *old_buffer = current_buffer; > set_buffer_internal (XBUFFER (buffer)); > > Fset (Qtree_sitter_parser_list, > Fcons (lisp_parser, Fsymbol_value (Qtree_sitter_parser_list))); > > set_buffer_internal (old_buffer); Yes, but it would be better to use DEFVAR_LISP and then you could assign directly to Vtree_sitter_parser_list, instead of using Fset. > Also, we don’t call change hooks in replace_range_2, why? Because it is called in a loop, one character at a time. The caller of replace_range_2 calls these hooks for the entire region, once. > Should I update tree-sitter trees in that function, or should I not? The only caller is casify_region, so you could update there. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-22 19:05 ` Eli Zaretskii @ 2021-07-23 13:25 ` Yuan Fu 2021-07-23 19:10 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-07-23 13:25 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Clément Pit-Claudel, Stefan Monnier, emacs-devel > On Jul 22, 2021, at 3:05 PM, Eli Zaretskii <eliz@gnu.org> wrote: > >> From: Yuan Fu <casouri@gmail.com> >> Date: Thu, 22 Jul 2021 13:47:20 -0400 >> Cc: Stefan Monnier <monnier@iro.umontreal.ca>, >> Clément Pit-Claudel <cpitclaudel@gmail.com>, >> emacs-devel@gnu.org >> >>> More generally: is the problem real? If you make a file that is 1000 >>> copies of xdisp.c, and then submit it to TS, do you really get 10GB of >>> memory consumption? This is something that is good to know up front, >>> so we'd know what to expect down the road. >> >> Yes. I concatenated 100 xdisp.c together, and parsed them with my simple C program. It used 1.8 G. I didn’t test for 1000 together, but I think the trend is linear. > > That's good to know, thanks. > > So what does TS do if it attempts to allocate more memory and that > fails? Regardless, we'd need some fallback strategy, because AFAIU > many people run with VM overcommit enabled, so the OOM killer will > just kill the Emacs process when it asks for too much memory. Abort, it seems: static inline void *ts_malloc_default(size_t size) { void *result = malloc(size); if (size > 0 && !result) { fprintf(stderr, "tree-sitter failed to allocate %zu bytes", size); exit(1); } return result; } >> Also, we don’t call change hooks in replace_range_2, why? > > Because it is called in a loop, one character at a time. The caller > of replace_range_2 calls these hooks for the entire region, once. > >> Should I update tree-sitter trees in that function, or should I not? > > The only caller is casify_region, so you could update there. casify_region doesn’t have access to byte positions. I’ll leave it as-is, recording change in replace_range_2, if you don’t object to it. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-23 13:25 ` Yuan Fu @ 2021-07-23 19:10 ` Eli Zaretskii 2021-07-23 20:01 ` Perry E. Metzger 2021-07-23 20:22 ` Yuan Fu 0 siblings, 2 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-07-23 19:10 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Fri, 23 Jul 2021 09:25:17 -0400 > Cc: Stefan Monnier <monnier@iro.umontreal.ca>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > emacs-devel@gnu.org > > > So what does TS do if it attempts to allocate more memory and that > > fails? Regardless, we'd need some fallback strategy, because AFAIU > > many people run with VM overcommit enabled, so the OOM killer will > > just kill the Emacs process when it asks for too much memory. > > Abort, it seems: > > static inline void *ts_malloc_default(size_t size) { > void *result = malloc(size); > if (size > 0 && !result) { > fprintf(stderr, "tree-sitter failed to allocate %zu bytes", size); > exit(1); > } > return result; > } We must replace this function, if only because the MS-Windows build of Emacs uses a custom malloc implementation. Does TS allow the client to use its own malloc? > >> Also, we don’t call change hooks in replace_range_2, why? > > > > Because it is called in a loop, one character at a time. The caller > > of replace_range_2 calls these hooks for the entire region, once. > > > >> Should I update tree-sitter trees in that function, or should I not? > > > > The only caller is casify_region, so you could update there. > > casify_region doesn’t have access to byte positions. You can compute them using CHAR_TO_BYTE. > I’ll leave it as-is, recording change in replace_range_2, if you don’t object to it. That'd be wasteful, I think. replace_range_2 is called one character at a time. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-23 19:10 ` Eli Zaretskii @ 2021-07-23 20:01 ` Perry E. Metzger 2021-07-24 5:52 ` Eli Zaretskii 2021-07-23 20:22 ` Yuan Fu 1 sibling, 1 reply; 370+ messages in thread From: Perry E. Metzger @ 2021-07-23 20:01 UTC (permalink / raw) To: emacs-devel On 7/23/21 15:10, Eli Zaretskii wrote: >> Abort, it seems: >> static inline void *ts_malloc_default(size_t size) { >> void *result = malloc(size); >> if (size > 0 && !result) { >> fprintf(stderr, "tree-sitter failed to allocate %zu bytes", size); >> exit(1); >> } >> return result; >> } > We must replace this function, if only because the MS-Windows build of > Emacs uses a custom malloc implementation. Does TS allow the client > to use its own malloc? Certainly more graceful allocation error behavior would be necessary in an Emacs context even on Unix-like operating systems. An unexpected hard exit could result in loss of data for the user. Perry ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-23 20:01 ` Perry E. Metzger @ 2021-07-24 5:52 ` Eli Zaretskii 0 siblings, 0 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-07-24 5:52 UTC (permalink / raw) To: Perry E. Metzger; +Cc: emacs-devel > Date: Fri, 23 Jul 2021 16:01:14 -0400 > From: "Perry E. Metzger" <perry@piermont.com> > > On 7/23/21 15:10, Eli Zaretskii wrote: > > >> Abort, it seems: > >> static inline void *ts_malloc_default(size_t size) { > >> void *result = malloc(size); > >> if (size > 0 && !result) { > >> fprintf(stderr, "tree-sitter failed to allocate %zu bytes", size); > >> exit(1); > >> } > >> return result; > >> } > > We must replace this function, if only because the MS-Windows build of > > Emacs uses a custom malloc implementation. Does TS allow the client > > to use its own malloc? > > Certainly more graceful allocation error behavior would be necessary in > an Emacs context even on Unix-like operating systems. An unexpected hard > exit could result in loss of data for the user. Sure, which is why this must be replaced. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-23 19:10 ` Eli Zaretskii 2021-07-23 20:01 ` Perry E. Metzger @ 2021-07-23 20:22 ` Yuan Fu 2021-07-24 6:00 ` Eli Zaretskii 2021-07-24 15:04 ` Yuan Fu 1 sibling, 2 replies; 370+ messages in thread From: Yuan Fu @ 2021-07-23 20:22 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Clément Pit-Claudel, Stefan Monnier, emacs-devel > > We must replace this function, if only because the MS-Windows build of > Emacs uses a custom malloc implementation. Does TS allow the client > to use its own malloc? Yes, in that case, we need to embed tree-sitter into Emacs, instead of using it as a dynamic library, I think. // Allow clients to override allocation functions #ifndef ts_malloc #define ts_malloc ts_malloc_default #endif #ifndef ts_calloc #define ts_calloc ts_calloc_default #endif #ifndef ts_realloc #define ts_realloc ts_realloc_default #endif #ifndef ts_free #define ts_free ts_free_default #endif How do we handle such thing in Emacs? > >>>> Also, we don’t call change hooks in replace_range_2, why? >>> >>> Because it is called in a loop, one character at a time. The caller >>> of replace_range_2 calls these hooks for the entire region, once. >>> >>>> Should I update tree-sitter trees in that function, or should I not? >>> >>> The only caller is casify_region, so you could update there. >> >> casify_region doesn’t have access to byte positions. > > You can compute them using CHAR_TO_BYTE. Ok. I’ll do that. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-23 20:22 ` Yuan Fu @ 2021-07-24 6:00 ` Eli Zaretskii 2021-07-25 18:01 ` Stephen Leake 2021-07-24 15:04 ` Yuan Fu 1 sibling, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-24 6:00 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Fri, 23 Jul 2021 16:22:59 -0400 > Cc: Stefan Monnier <monnier@iro.umontreal.ca>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > emacs-devel@gnu.org > > > We must replace this function, if only because the MS-Windows build of > > Emacs uses a custom malloc implementation. Does TS allow the client > > to use its own malloc? > > Yes, in that case, we need to embed tree-sitter into Emacs, instead of using it as a dynamic library, I think. > > // Allow clients to override allocation functions > #ifndef ts_malloc > #define ts_malloc ts_malloc_default > #endif > #ifndef ts_calloc > #define ts_calloc ts_calloc_default > #endif > #ifndef ts_realloc > #define ts_realloc ts_realloc_default > #endif > #ifndef ts_free > #define ts_free ts_free_default > #endif > > How do we handle such thing in Emacs? We use xmalloc, which calls memory_full when allocation fails, which releases some spare memory we have for this purpose, and tells the user to save the session and exit. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-24 6:00 ` Eli Zaretskii @ 2021-07-25 18:01 ` Stephen Leake 2021-07-25 19:09 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Stephen Leake @ 2021-07-25 18:01 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Yuan Fu, emacs-devel, cpitclaudel, monnier Eli Zaretskii <eliz@gnu.org> writes: >> From: Yuan Fu <casouri@gmail.com> >> Date: Fri, 23 Jul 2021 16:22:59 -0400 >> Cc: Stefan Monnier <monnier@iro.umontreal.ca>, >> Clément Pit-Claudel <cpitclaudel@gmail.com>, >> emacs-devel@gnu.org >> >> > We must replace this function, if only because the MS-Windows build of >> > Emacs uses a custom malloc implementation. Does TS allow the client >> > to use its own malloc? >> >> Yes, in that case, we need to embed tree-sitter into Emacs, instead >> of using it as a dynamic library, I think. >> >> // Allow clients to override allocation functions >> #ifndef ts_malloc >> #define ts_malloc ts_malloc_default >> #endif >> #ifndef ts_calloc >> #define ts_calloc ts_calloc_default >> #endif >> #ifndef ts_realloc >> #define ts_realloc ts_realloc_default >> #endif >> #ifndef ts_free >> #define ts_free ts_free_default >> #endif >> >> How do we handle such thing in Emacs? > > We use xmalloc, which calls memory_full when allocation fails, which > releases some spare memory we have for this purpose, and tells the > user to save the session and exit. I'm thinking about how this applies to wisi, when migrating to a module. Ada has a built-in allocator; it's probably possible to change that, but I'd like to understand exactly why we need to do that. The Ada allocator throws an exception on allocation fail; is it sufficient to turn that exception into an elisp signal, and arrange for elisp to call memory_full (or take some other action, like killing the parser)? Another possible reason to change the Ada allocator is if we want to expose Ada memory pointers directly to elisp, as Yuan Fu wants to do for tree-sitter (I don't plan to do this for wisi). Does that require that the pointers be allocated by the same allocator? I'm not clear what that would mean for the garbage collector; is it then expected to recover the tree-sitter-allocated memory for the tree? or does it ignore those lisp objects? -- -- Stephe ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-25 18:01 ` Stephen Leake @ 2021-07-25 19:09 ` Eli Zaretskii 2021-07-26 5:10 ` Stephen Leake 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-25 19:09 UTC (permalink / raw) To: Stephen Leake; +Cc: casouri, emacs-devel, cpitclaudel, monnier > From: Stephen Leake <stephen_leake@stephe-leake.org> > Cc: Yuan Fu <casouri@gmail.com>, cpitclaudel@gmail.com, > monnier@iro.umontreal.ca, emacs-devel@gnu.org > Date: Sun, 25 Jul 2021 11:01:22 -0700 > > >> How do we handle such thing in Emacs? > > > > We use xmalloc, which calls memory_full when allocation fails, which > > releases some spare memory we have for this purpose, and tells the > > user to save the session and exit. > > I'm thinking about how this applies to wisi, when migrating to a module. > > Ada has a built-in allocator; it's probably possible to change that, but > I'd like to understand exactly why we need to do that. We need that to allow the user to save the session while he/she can. > The Ada allocator throws an exception on allocation fail; is it > sufficient to turn that exception into an elisp signal, and arrange for > elisp to call memory_full (or take some other action, like killing the > parser)? What is a "lisp signal" in this context? > Another possible reason to change the Ada allocator is if we want to > expose Ada memory pointers directly to elisp, as Yuan Fu wants to do for > tree-sitter (I don't plan to do this for wisi). Does that require that > the pointers be allocated by the same allocator? Same allocator as what? > I'm not clear what that would mean for the garbage collector; is it > then expected to recover the tree-sitter-allocated memory for the > tree? or does it ignore those lisp objects? It depends on which Lisp object you wrap those pointers. User-pointer object allow you to provide your own "finalizer" function. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-25 19:09 ` Eli Zaretskii @ 2021-07-26 5:10 ` Stephen Leake 2021-07-26 12:56 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Stephen Leake @ 2021-07-26 5:10 UTC (permalink / raw) To: Eli Zaretskii; +Cc: casouri, emacs-devel, cpitclaudel, monnier Eli Zaretskii <eliz@gnu.org> writes: >> From: Stephen Leake <stephen_leake@stephe-leake.org> >> Cc: Yuan Fu <casouri@gmail.com>, cpitclaudel@gmail.com, >> monnier@iro.umontreal.ca, emacs-devel@gnu.org >> Date: Sun, 25 Jul 2021 11:01:22 -0700 >> >> >> How do we handle such thing in Emacs? >> > >> > We use xmalloc, which calls memory_full when allocation fails, which >> > releases some spare memory we have for this purpose, and tells the >> > user to save the session and exit. >> >> I'm thinking about how this applies to wisi, when migrating to a module. >> >> Ada has a built-in allocator; it's probably possible to change that, but >> I'd like to understand exactly why we need to do that. > > We need that to allow the user to save the session while he/she can. > >> The Ada allocator throws an exception on allocation fail; is it >> sufficient to turn that exception into an elisp signal, and arrange for >> elisp to call memory_full (or take some other action, like killing the >> parser)? > > What is a "lisp signal" in this context? The module interface layer of wisi.el would do: (signal 'error "parser ran out of memory") >> Another possible reason to change the Ada allocator is if we want to >> expose Ada memory pointers directly to elisp, as Yuan Fu wants to do for >> tree-sitter (I don't plan to do this for wisi). Does that require that >> the pointers be allocated by the same allocator? > > Same allocator as what? As other lisp symbols. >> I'm not clear what that would mean for the garbage collector; is it >> then expected to recover the tree-sitter-allocated memory for the >> tree? or does it ignore those lisp objects? > > It depends on which Lisp object you wrap those pointers. User-pointer > object allow you to provide your own "finalizer" function. Ok, that would work. -- -- Stephe ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-26 5:10 ` Stephen Leake @ 2021-07-26 12:56 ` Eli Zaretskii 0 siblings, 0 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-07-26 12:56 UTC (permalink / raw) To: Stephen Leake; +Cc: casouri, emacs-devel, cpitclaudel, monnier > From: Stephen Leake <stephen_leake@stephe-leake.org> > Cc: casouri@gmail.com, cpitclaudel@gmail.com, monnier@iro.umontreal.ca, > emacs-devel@gnu.org > Date: Sun, 25 Jul 2021 22:10:12 -0700 > > >> The Ada allocator throws an exception on allocation fail; is it > >> sufficient to turn that exception into an elisp signal, and arrange for > >> elisp to call memory_full (or take some other action, like killing the > >> parser)? > > > > What is a "lisp signal" in this context? > > The module interface layer of wisi.el would do: > > (signal 'error "parser ran out of memory") We don't have such an error (and handling an error when you've run out of memory could backfire). > >> Another possible reason to change the Ada allocator is if we want to > >> expose Ada memory pointers directly to elisp, as Yuan Fu wants to do for > >> tree-sitter (I don't plan to do this for wisi). Does that require that > >> the pointers be allocated by the same allocator? > > > > Same allocator as what? > > As other lisp symbols. Not sure, perhaps you could free them in a finalizer instead. If you want GC to free them, then yes. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-23 20:22 ` Yuan Fu 2021-07-24 6:00 ` Eli Zaretskii @ 2021-07-24 15:04 ` Yuan Fu 2021-07-24 15:48 ` Eli Zaretskii 2021-07-24 16:14 ` Eli Zaretskii 1 sibling, 2 replies; 370+ messages in thread From: Yuan Fu @ 2021-07-24 15:04 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Clément Pit-Claudel, Stefan Monnier, emacs-devel [-- Attachment #1: Type: text/plain, Size: 2070 bytes --] I wrote a simple interface between font-lock and tree-sitter, and it works pretty well: using tree-sitter for fontification, xdisp.c opens a lot faster, and scrolling through the buffer is also perceivably faster. My simple interface works like this: tree-sitter allow you to “pattern match” nodes in the parse tree with a DSL, and assign names to the matched nodes, e.g., given a pattern, you get back a list of (NAME . MATCHED-NODE). And if we use font-lock faces as names for those nodes, we get back a list of (FACE . MATCHED-NODE) from tree-sitter, and Emacs can simply look at the beginning and end of the node, and apply FACE to that range. For flexibility, FACE can also be a function, in which case the function is called with the node. This interface is basically what emacs-tree-sitter does (I don’t know if they allow a capture name to be a function, though.) I have an example major-mode for C that uses tree-sitter for font-locking at the end of tree-sitter.el. Main functions to look at: tree-sitter-query-capture in tree_sitter.c, and tree-sitter-fontify-region-function in tree-sitter.el. On the font-lock front, tree-sitter-fontify-region-function replaces font-lock-default-fontify-region, and tree-sitter-font-lock-settings replaces font-lock-defaults and font-lock-keywords. I should support font-lock-maximum-decoration but haven’t came up with a good way to do that. Maybe I should somehow reuse font-lock-defaults, and make it able to configure for tree-sitter font-locking? Apart from font-lock-maximum-decoration, what else should tree-sitter share with font-lock? BTW, what is the best way to signal a lisp error from C? I tried xsignal2, signal_error, error and friends but they seem to crash Emacs. Maybe I wasn’t using them correctly. IIUC if we want tree-sitter to use our malloc, we need to build it with Emacs, where should I put the source of tree-sitter? What’s the different between make_string and make_pure_c_string? I’ve seen this “pure” thing else where, what does “pure” mean? Yuan [-- Attachment #2: ts.4.patch --] [-- Type: application/octet-stream, Size: 36844 bytes --] From d28e10e5905d244d92b71b74566c0bed80d5ed2b Mon Sep 17 00:00:00 2001 From: Yuan Fu <casouri@gmail.com> Date: Sat, 24 Jul 2021 10:39:15 -0400 Subject: [PATCH] checkpoint 4 - Add font-locking - Remove change-recording from replace_range_2, add to casify_region --- lisp/emacs-lisp/cl-preloaded.el | 2 + lisp/tree-sitter.el | 276 ++++++++++++++++++++++++++ src/casefiddle.c | 12 ++ src/insdel.c | 11 +- src/tree_sitter.c | 332 +++++++++++++++++++++++++++++--- src/tree_sitter.h | 4 +- test/src/tree-sitter-tests.el | 58 +++++- 7 files changed, 655 insertions(+), 40 deletions(-) create mode 100644 lisp/tree-sitter.el diff --git a/lisp/emacs-lisp/cl-preloaded.el b/lisp/emacs-lisp/cl-preloaded.el index 7365e23186..2dccdff91a 100644 --- a/lisp/emacs-lisp/cl-preloaded.el +++ b/lisp/emacs-lisp/cl-preloaded.el @@ -68,6 +68,8 @@ cl--typeof-types (font-spec atom) (font-entity atom) (font-object atom) (vector array sequence atom) (user-ptr atom) + (tree-sitter-parser atom) + (tree-sitter-node atom) ;; Plus, really hand made: (null symbol list sequence atom)) "Alist of supertypes. diff --git a/lisp/tree-sitter.el b/lisp/tree-sitter.el new file mode 100644 index 0000000000..a6ecb09386 --- /dev/null +++ b/lisp/tree-sitter.el @@ -0,0 +1,276 @@ +;;; tree-sitter.el --- tree-sitter utilities -*- lexical-binding: t -*- + +;; Copyright (C) 2021 Free Software Foundation, Inc. + +;; This file is part of GNU Emacs. + +;; GNU Emacs is free software: you can redistribute it and/or modify +;; it under the terms of the GNU General Public License as published by +;; the Free Software Foundation, either version 3 of the License, or +;; (at your option) any later version. + +;; GNU Emacs is distributed in the hope that it will be useful, +;; but WITHOUT ANY WARRANTY; without even the implied warranty of +;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +;; GNU General Public License for more details. + +;; You should have received a copy of the GNU General Public License +;; along with GNU Emacs. If not, see <https://www.gnu.org/licenses/>. + +;;; Commentary: + +;;; Code: + +;;; Node & parser accessors + +(defun tree-sitter-node-buffer (node) + "Return the buffer in where NODE belongs." + (tree-sitter-parser-buffer + (tree-sitter-node-parser node))) + +;;; Parser API supplement + +(defun tree-sitter-get-parser (name) + "Find the first parser with name NAME in `tree-sitter-parser-list'. +Return nil if we can't find any." + (catch 'found + (dolist (parser tree-sitter-parser-list) + (when (equal name (tree-sitter-parser-name parser)) + (throw 'found parser))))) + +(defun tree-sitter-get-parser-create (name language) + "Find the first parser with name NAME in `tree-sitter-parser-list'. +If none exists, create one and return it. LANGUAGE is passed to +`tree-sitter-create-parser' when creating the parser." + (or (tree-sitter-get-parser name) + (tree-sitter-create-parser (current-buffer) language name))) + +;;; Node API supplement + +(defun tree-sitter-node-beginning (node) + "Return the start position of NODE." + (byte-to-position (tree-sitter-node-start-byte node))) + +(defun tree-sitter-node-end (node) + "Return the end position of NODE." + (byte-to-position (tree-sitter-node-end-byte node))) + +(defun tree-sitter-node-in-range (beg end &optional parser-name named) + "Return the smallest node covering BEG to END. +Find node in current buffer. Return nil if none find. If NAMED +non-nil, only look for named node. NAMED defaults to nil. By +default, use the first parser in `tree-sitter-parser-list'; but +if PARSER-NAME is non-nil, it specifies the name of the parser that +should be used." + (when-let ((root (tree-sitter-parser-root-node + (if parser-name + (tree-sitter-get-parser parser-name) + (car tree-sitter-parser-list))))) + (tree-sitter-node-descendant-for-byte-range + root (position-bytes beg) (position-bytes end) named))) + +(defun tree-sitter-filter-child (node pred &optional named) + "Return children of NODE that satisfies PRED. +PRED is a function that takes one argument, the child node. If +NAMED non-nil, only search named node. NAMED defaults to nil." + (let ((child (tree-sitter-node-child node 0 named)) + result) + (while child + (when (funcall pred child) + (push child result)) + (setq child (tree-sitter-node-next-sibling child named))) + result)) + +(defun tree-sitter-node-content (node) + "Return the buffer content corresponding to NODE." + (with-current-buffer (tree-sitter-node-buffer node) + (buffer-substring-no-properties + (tree-sitter-node-beginning node) + (tree-sitter-node-end node)))) + +;;; Font-lock + +(defvar-local tree-sitter-font-lock-settings nil + "A list of settings for tree-sitter-based font-locking. + +Each setting controls one parser (often of different language). +A settings is a list of form (NAME LANGUAGE PATTERN). NAME is +the name given to the parser, by convention it is +\"font-lock-<language>\", where <language> is the language that +the parser uses. LANGUAGE is the language object returned by +tree-sitter language dynamic modules. + +PATTERN is a tree-sitter query pattern. (See manual for how to +write query patterns.) This pattern should capture nodes with +either face names or function names. If captured with a face +name, the node's corresponding text in the buffer is fontified +with that face; if captured with a function name, the function is +called with three arguments, BEG END NODE, where BEG and END +marks the span of the corresponding text, and NODE is the node +itself.") + +(defun tree-sitter-fontify-region-function (beg end &optional verbose) + "Fontify the region between BEG and END. +If VERBOSE is non-nil, print status messages. +\(See `font-lock-fontify-region-function'.)" + (dolist (elm tree-sitter-font-lock-settings) + (let ((parser-name (car elm)) + (language (nth 1 elm)) + (match-pattern (nth 2 elm))) + (tree-sitter-get-parser-create parser-name language) + (when-let ((node (tree-sitter-node-in-range beg end parser-name))) + (let ((captures (tree-sitter-query-capture + node match-pattern + ;; specifying the range is important. More + ;; often than not, NODE will be the root + ;; node, and if we don't specify the range, + ;; we are basically querying the whole file. + (position-bytes beg) (position-bytes end)))) + (with-silent-modifications + (while captures + (let* ((face (caar captures)) + (node (cdar captures)) + (beg (tree-sitter-node-beginning node)) + (end (tree-sitter-node-end node))) + (cond ((facep face) + (put-text-property beg end 'face face)) + ((functionp face) + (funcall face beg end node))) + + (if verbose + (message "Fontifying text from %d to %d with %s" + beg end face))) + (setq captures (cdr captures)))) + `(jit-lock-bounds ,(tree-sitter-node-beginning node) + . ,(tree-sitter-node-end node))))))) + + +(define-derived-mode json-mode js-mode "JSON" + "Major mode for JSON documents." + (setq-local font-lock-fontify-region-function + #'tree-sitter-fontify-region-function) + (setq-local tree-sitter-font-lock-settings + `(("font-lock-json" + ,(tree-sitter-json) + "(string) @font-lock-string-face +(true) @font-lock-constant-face +(false) @font-lock-constant-face +(null) @font-lock-constant-face")))) + +(defun ts-c-fontify-system-lib (beg end _) + (put-text-property beg (1+ beg) 'face 'font-lock-preprocessor-face) + (put-text-property (1- end) end 'face 'font-lock-preprocessor-face) + (put-text-property (1+ beg) (1- end) + 'face 'font-lock-string-face)) + +(define-derived-mode ts-c-mode prog-mode "TS C" + "C mode with tree-sitter support." + (setq-local font-lock-fontify-region-function + #'tree-sitter-fontify-region-function) + (setq-local tree-sitter-font-lock-settings + `(("font-lock-c" + ,(tree-sitter-c) + "(null) @font-lock-constant-face +(true) @font-lock-constant-face +(false) @font-lock-constant-face + +(comment) @font-lock-comment-face + +(system_lib_string) @ts-c-fontify-system-lib + +(unary_expression + operator: _ @font-lock-negation-char-face) + +(string_literal) @font-lock-string-face +(char_literal) @font-lock-string-face + + + +(function_definition + declarator: (identifier) @font-lock-function-name-face) + +(declaration + declarator: (identifier) @font-lock-function-name-face) + +(function_declarator + declarator: (identifier) @font-lock-function-name-face) + + + +(init_declarator + declarator: (identifier) @font-lock-variable-name-face) + +(parameter_declaration + declarator: (identifier) @font-lock-variable-name-face) + +(preproc_def + name: (identifier) @font-lock-variable-name-face) + +(enumerator + name: (identifier) @font-lock-variable-name-face) + +(field_identifier) @font-lock-variable-name-face + +(parameter_list + (parameter_declaration + (identifier) @font-lock-variable-name-face)) + +(pointer_declarator + declarator: (identifier) @font-lock-variable-name-face) + +(array_declarator + declarator: (identifier) @font-lock-variable-name-face) + +(preproc_function_def + name: (identifier) @font-lock-variable-name-face + parameters: (preproc_params + (identifier) @font-lock-variable-name-face)) + + + +(type_identifier) @font-lock-type-face +(primitive_type) @font-lock-type-face + +\"auto\" @font-lock-keyword-face +\"break\" @font-lock-keyword-face +\"case\" @font-lock-keyword-face +\"const\" @font-lock-keyword-face +\"continue\" @font-lock-keyword-face +\"default\" @font-lock-keyword-face +\"do\" @font-lock-keyword-face +\"else\" @font-lock-keyword-face +\"enum\" @font-lock-keyword-face +\"extern\" @font-lock-keyword-face +\"for\" @font-lock-keyword-face +\"goto\" @font-lock-keyword-face +\"if\" @font-lock-keyword-face +\"register\" @font-lock-keyword-face +\"return\" @font-lock-keyword-face +\"sizeof\" @font-lock-keyword-face +\"static\" @font-lock-keyword-face +\"struct\" @font-lock-keyword-face +\"switch\" @font-lock-keyword-face +\"typedef\" @font-lock-keyword-face +\"union\" @font-lock-keyword-face +\"volatile\" @font-lock-keyword-face +\"while\" @font-lock-keyword-face + +\"long\" @font-lock-type-face +\"short\" @font-lock-type-face +\"signed\" @font-lock-type-face +\"unsigned\" @font-lock-type-face + +\"#include\" @font-lock-preprocessor-face +\"#define\" @font-lock-preprocessor-face +\"#ifdef\" @font-lock-preprocessor-face +\"#ifndef\" @font-lock-preprocessor-face +\"#endif\" @font-lock-preprocessor-face +\"#else\" @font-lock-preprocessor-face +\"#elif\" @font-lock-preprocessor-face")))) + +(add-to-list 'auto-mode-alist '("\\.json\\'" . json-mode)) +(add-to-list 'auto-mode-alist '("\\.tsc\\'" . ts-c-mode)) + +(provide 'tree-sitter) + +;;; tree-sitter.el ends here diff --git a/src/casefiddle.c b/src/casefiddle.c index a7a2541490..42cd2fdd28 100644 --- a/src/casefiddle.c +++ b/src/casefiddle.c @@ -30,6 +30,10 @@ Copyright (C) 1985, 1994, 1997-1999, 2001-2021 Free Software Foundation, #include "composite.h" #include "keymap.h" +#ifdef HAVE_TREE_SITTER +#include "tree_sitter.h" +#endif + enum case_action {CASE_UP, CASE_DOWN, CASE_CAPITALIZE, CASE_CAPITALIZE_UP}; /* State for casing individual characters. */ @@ -495,6 +499,11 @@ casify_region (enum case_action flag, Lisp_Object b, Lisp_Object e) modify_text (start, end); prepare_casing_context (&ctx, flag, true); +#ifdef HAVE_TREE_SITTER + ptrdiff_t start_byte = CHAR_TO_BYTE (start); + ptrdiff_t old_end_byte = CHAR_TO_BYTE (end); +#endif + ptrdiff_t orig_end = end; record_delete (start, make_buffer_string (start, end, true), false); if (NILP (BVAR (current_buffer, enable_multibyte_characters))) @@ -513,6 +522,9 @@ casify_region (enum case_action flag, Lisp_Object b, Lisp_Object e) { signal_after_change (start, end - start - added, end - start); update_compositions (start, end, CHECK_ALL); +#ifdef HAVE_TREE_SITTER + ts_record_change (start_byte, old_end_byte, CHAR_TO_BYTE (end)); +#endif } return orig_end + added; diff --git a/src/insdel.c b/src/insdel.c index b313c50cda..3dfc281b49 100644 --- a/src/insdel.c +++ b/src/insdel.c @@ -1592,7 +1592,11 @@ replace_range (ptrdiff_t from, ptrdiff_t to, Lisp_Object new, If MARKERS, relocate markers. Unlike most functions at this level, never call - prepare_to_modify_buffer and never call signal_after_change. */ + prepare_to_modify_buffer and never call signal_after_change. + Because this function is called in a loop, one character at a time. + The caller of 'replace_range_2' calls these hooks for the entire + region once. Apart from signal_after_change, any caller of this + function should also call ts_record_change. */ void replace_range_2 (ptrdiff_t from, ptrdiff_t from_byte, @@ -1705,11 +1709,6 @@ replace_range_2 (ptrdiff_t from, ptrdiff_t from_byte, modiff_incr (&MODIFF); CHARS_MODIFF = MODIFF; - -#ifdef HAVE_TREE_SITTER - ts_record_change (from_byte, to_byte, from_byte + insbytes); -#endif - } \f /* Delete characters in current buffer diff --git a/src/tree_sitter.c b/src/tree_sitter.c index a6a8912c84..e9f8ddc7e3 100644 --- a/src/tree_sitter.c +++ b/src/tree_sitter.c @@ -35,6 +35,8 @@ Copyright (C) 2021 Free Software Foundation, Inc. /* parser.h defines a macro ADVANCE that conflicts with alloc.c. */ #include <tree_sitter/parser.h> +/*** Functions related to parser and node object. */ + DEFUN ("tree-sitter-parser-p", Ftree_sitter_parser_p, Stree_sitter_parser_p, 1, 1, 0, doc: /* Return t if OBJECT is a tree-sitter parser. */) @@ -57,6 +59,8 @@ DEFUN ("tree-sitter-node-p", return Qnil; } +/*** Parsing functions */ + /* Update each parser's tree after the user made an edit. This function does not parse the buffer and only updates the tree. (So it should be very fast.) */ @@ -77,7 +81,6 @@ ts_record_change (ptrdiff_t start_byte, ptrdiff_t old_end_byte, XTS_PARSER (lisp_parser)->need_reparse = true; parser_list = Fcdr (parser_list); } - } /* Parse the buffer. We don't parse until we have to. When we have @@ -91,9 +94,19 @@ ts_ensure_parsed (Lisp_Object parser) TSTree *tree = XTS_PARSER(parser)->tree; TSInput input = XTS_PARSER (parser)->input; TSTree *new_tree = ts_parser_parse(ts_parser, tree, input); + /* This should be very rare: it only happens when 1) language is not + set (impossible in Emacs because the user has to supply a + language to create a parser), 2) parse canceled due to timeout + (impossible because we don't set a timeout), 3) parse canceled + due to cancellation flag (impossible because we don't set the + flag). (See comments for ts_parser_parse in + tree_sitter/api.h.) */ + if (new_tree == NULL) + signal_error ("Parse failed", parser); ts_tree_delete (tree); XTS_PARSER (parser)->tree = new_tree; XTS_PARSER (parser)->need_reparse = false; + TSNode node = ts_tree_root_node (new_tree); } /* This is the read function provided to tree-sitter to read from a @@ -103,9 +116,6 @@ ts_ensure_parsed (Lisp_Object parser) ts_read_buffer (void *buffer, uint32_t byte_index, TSPoint position, uint32_t *bytes_read) { - if (!BUFFER_LIVE_P ((struct buffer *) buffer)) - error ("BUFFER is not live"); - ptrdiff_t byte_pos = byte_index + 1; /* Read one character. Tree-sitter wants us to set bytes_read to 0 @@ -114,8 +124,17 @@ ts_read_buffer (void *buffer, uint32_t byte_index, string. */ char *beg; int len; + /* This function could run from a user command, so it is better to + do nothing instead of raising an error. (It was a pain in the a** + to read mega-if-conditions in Emacs source, so I write the two + branches separately, hoping the compiler can merge them.) */ + if (!BUFFER_LIVE_P ((struct buffer *) buffer)) + { + beg = ""; + len = 0; + } // TODO BUF_ZV_BYTE? - if (byte_pos >= BUF_Z_BYTE ((struct buffer *) buffer)) + else if (byte_pos >= BUF_Z_BYTE ((struct buffer *) buffer)) { beg = ""; len = 0; @@ -123,19 +142,23 @@ ts_read_buffer (void *buffer, uint32_t byte_index, else { beg = (char *) BUF_BYTE_ADDRESS (buffer, byte_pos); - len = next_char_len(byte_pos); + len = BYTES_BY_CHAR_HEAD ((int) beg); } *bytes_read = (uint32_t) len; return beg; } +/*** Creators and accessors for parser and node */ + /* Wrap the parser in a Lisp_Object to be used in the Lisp machine. */ Lisp_Object -make_ts_parser (struct buffer *buffer, TSParser *parser, TSTree *tree) +make_ts_parser (struct buffer *buffer, TSParser *parser, + TSTree *tree, Lisp_Object name) { struct Lisp_TS_Parser *lisp_parser - = ALLOCATE_PLAIN_PSEUDOVECTOR (struct Lisp_TS_Parser, PVEC_TS_PARSER); + = ALLOCATE_PSEUDOVECTOR (struct Lisp_TS_Parser, name, PVEC_TS_PARSER); + lisp_parser->name = name; lisp_parser->buffer = buffer; lisp_parser->parser = parser; lisp_parser->tree = tree; @@ -156,17 +179,35 @@ make_ts_node (Lisp_Object parser, TSNode node) return make_lisp_ptr (lisp_node, Lisp_Vectorlike); } +DEFUN ("tree-sitter-node-parser", + Ftree_sitter_node_parser, Stree_sitter_node_parser, + 1, 1, 0, + doc: /* Return the parser to which NODE belongs. */) + (Lisp_Object node) +{ + CHECK_TS_NODE (node); + return XTS_NODE (node)->parser; +} DEFUN ("tree-sitter-create-parser", Ftree_sitter_create_parser, Stree_sitter_create_parser, - 2, 2, 0, + 2, 3, 0, doc: /* Create and return a parser in BUFFER for LANGUAGE. + The parser is automatically added to BUFFER's `tree-sitter-parser-list'. LANGUAGE should be the language provided -by a tree-sitter language dynamic module. */) - (Lisp_Object buffer, Lisp_Object language) +by a tree-sitter language dynamic module. + +NAME (a string) is the name assigned to the parser, like the name for +a process. Unlike process names, not care is taken to make each +parser's name unique. By default, no name is assigned to the parser; +the only consequence of that is you can't use +`tree-sitter-get-parser' to find the parser by its name. */) + (Lisp_Object buffer, Lisp_Object language, Lisp_Object name) { CHECK_BUFFER(buffer); + if (!NILP (name)) + CHECK_STRING (name); /* LANGUAGE is a USER_PTR that contains the pointer to a TSLanguage struct. */ @@ -175,9 +216,8 @@ DEFUN ("tree-sitter-create-parser", ts_parser_set_language (parser, lang); Lisp_Object lisp_parser - = make_ts_parser (XBUFFER(buffer), parser, NULL); + = make_ts_parser (XBUFFER(buffer), parser, NULL, name); - // FIXME: Is this the correct way to set a buffer-local variable? struct buffer *old_buffer = current_buffer; set_buffer_internal (XBUFFER (buffer)); @@ -188,6 +228,30 @@ DEFUN ("tree-sitter-create-parser", return lisp_parser; } +DEFUN ("tree-sitter-parser-buffer", + Ftree_sitter_parser_buffer, Stree_sitter_parser_buffer, + 1, 1, 0, + doc: /* Return the buffer of PARSER. */) + (Lisp_Object parser) +{ + CHECK_TS_PARSER (parser); + Lisp_Object buf; + XSETBUFFER (buf, XTS_PARSER (parser)->buffer); + return buf; +} + +DEFUN ("tree-sitter-parser-name", + Ftree_sitter_parser_name, Stree_sitter_parser_name, + 1, 1, 0, + doc: /* Return parser's name. */) + (Lisp_Object parser) +{ + CHECK_TS_PARSER (parser); + return XTS_PARSER (parser)->name; +} + +/*** Parser API */ + DEFUN ("tree-sitter-parser-root-node", Ftree_sitter_parser_root_node, Stree_sitter_parser_root_node, 1, 1, 0, @@ -200,7 +264,8 @@ DEFUN ("tree-sitter-parser-root-node", return make_ts_node (parser, root_node); } -DEFUN ("tree-sitter-parse", Ftree_sitter_parse, Stree_sitter_parse, +DEFUN ("tree-sitter-parse-string", + Ftree_sitter_parse_string, Stree_sitter_parse_string, 2, 2, 0, doc: /* Parse STRING and return the root node. LANGUAGE should be the language provided by a tree-sitter language @@ -219,23 +284,20 @@ DEFUN ("tree-sitter-parse", Ftree_sitter_parse, Stree_sitter_parse, SSDATA (string), strlen (SSDATA (string))); - /* See comment for ts_parser_parse in tree_sitter/api.h - for possible reasons for a failure. */ + /* See comment in ts_ensure_parsed for possible reasons for a + failure. */ if (tree == NULL) signal_error ("Failed to parse STRING", string); TSNode root_node = ts_tree_root_node (tree); - Lisp_Object lisp_parser = make_ts_parser (NULL, parser, tree); + Lisp_Object lisp_parser = make_ts_parser (NULL, parser, tree, Qnil); Lisp_Object lisp_node = make_ts_node (lisp_parser, root_node); return lisp_node; } -/* Below this point are uninteresting mechanical translations of - tree-sitter API. */ - -/* Node functions. */ +/*** Node API */ DEFUN ("tree-sitter-node-type", Ftree_sitter_node_type, Stree_sitter_node_type, 1, 1, 0, @@ -245,9 +307,31 @@ DEFUN ("tree-sitter-node-type", CHECK_TS_NODE (node); TSNode ts_node = XTS_NODE (node)->node; const char *type = ts_node_type(ts_node); + // TODO: Maybe return a string instead. return intern_c_string (type); } +DEFUN ("tree-sitter-node-start-byte", + Ftree_sitter_node_start_byte, Stree_sitter_node_start_byte, 1, 1, 0, + doc: /* Return the NODE's start byte position. */) + (Lisp_Object node) +{ + CHECK_TS_NODE (node); + TSNode ts_node = XTS_NODE (node)->node; + uint32_t start_byte = ts_node_start_byte(ts_node); + return make_fixnum(start_byte + 1); +} + +DEFUN ("tree-sitter-node-end-byte", + Ftree_sitter_node_end_byte, Stree_sitter_node_end_byte, 1, 1, 0, + doc: /* Return the NODE's end byte position. */) + (Lisp_Object node) +{ + CHECK_TS_NODE (node); + TSNode ts_node = XTS_NODE (node)->node; + uint32_t end_byte = ts_node_end_byte(ts_node); + return make_fixnum(end_byte + 1); +} DEFUN ("tree-sitter-node-string", Ftree_sitter_node_string, Stree_sitter_node_string, 1, 1, 0, @@ -303,9 +387,9 @@ DEFUN ("tree-sitter-node-check", Ftree_sitter_node_check, Stree_sitter_node_check, 2, 2, 0, doc: /* Return non-nil if NODE is in condition COND, nil otherwise. -COND could be 'named, 'missing, 'extra, 'has-error. Named nodes -correspond to named rules in the grammar, whereas "anonymous" nodes -correspond to string literals in the grammar. +COND could be 'named, 'missing, 'extra, 'has-changes, 'has-error. +Named nodes correspond to named rules in the grammar, whereas +"anonymous" nodes correspond to string literals in the grammar. Missing nodes are inserted by the parser in order to recover from certain kinds of syntax errors, i.e., should be there but not there. @@ -313,6 +397,9 @@ DEFUN ("tree-sitter-node-check", Extra nodes represent things like comments, which are not required the grammar, but can appear anywhere. +A node "has changes" if the buffer changed since the node is +created. (Don't forget the "s" at the end of 'has-changes.) + A node "has error" if itself is a syntax error or contains any syntax errors. */) (Lisp_Object node, Lisp_Object cond) @@ -329,7 +416,10 @@ DEFUN ("tree-sitter-node-check", result = ts_node_is_extra (ts_node); else if (EQ (cond, Qhas_error)) result = ts_node_has_error (ts_node); + else if (EQ (cond, Qhas_changes)) + result = ts_node_has_changes (ts_node); else + // TODO: Is this a good error message? signal_error ("Expecting one of four symbols, see docstring", cond); return result ? Qt : Qnil; } @@ -432,8 +522,177 @@ DEFUN ("tree-sitter-node-prev-sibling", return make_ts_node(XTS_NODE (node)->parser, sibling); } +DEFUN ("tree-sitter-node-first-child-for-byte", + Ftree_sitter_node_first_child_for_byte, + Stree_sitter_node_first_child_for_byte, 2, 3, 0, + doc: /* Return the first child of NODE on POS. +Specifically, return the first child that extends beyond POS. POS is +a byte position in the buffer counting from 1. Return nil if there +isn't any. If NAMED is non-nil, look for named child only. NAMED +defaults to nil. Note that this function returns an immediate child, +not the smallest (grand)child. */) + (Lisp_Object node, Lisp_Object pos, Lisp_Object named) +{ + CHECK_INTEGER (pos); + + struct buffer *buf = (XTS_PARSER (XTS_NODE (node)->parser)->buffer); + ptrdiff_t byte_pos = XFIXNUM (pos); + + if (byte_pos < BUF_BEGV_BYTE (buf) || byte_pos > BUF_ZV_BYTE (buf)) + xsignal1 (Qargs_out_of_range, pos); + + TSNode ts_node = XTS_NODE (node)->node; + TSNode child; + if (NILP (named)) + child = ts_node_first_child_for_byte (ts_node, byte_pos - 1); + else + child = ts_node_first_named_child_for_byte (ts_node, byte_pos - 1); + + if (ts_node_is_null(child)) + return Qnil; + + return make_ts_node(XTS_NODE (node)->parser, child); +} + +DEFUN ("tree-sitter-node-descendant-for-byte-range", + Ftree_sitter_node_descendant_for_byte_range, + Stree_sitter_node_descendant_for_byte_range, 3, 4, 0, + doc: /* Return the smallest node that covers BEG to END. +The returned node is a descendant of NODE. POS is a byte position +counting from 1. Return nil if there isn't any. If NAMED is non-nil, +look for named child only. NAMED defaults to nil. */) + (Lisp_Object node, Lisp_Object beg, Lisp_Object end, Lisp_Object named) +{ + CHECK_INTEGER (beg); + CHECK_INTEGER (end); + + struct buffer *buf = (XTS_PARSER (XTS_NODE (node)->parser)->buffer); + ptrdiff_t byte_beg = XFIXNUM (beg); + ptrdiff_t byte_end = XFIXNUM (end); + + /* Checks for BUFFER_BEG <= BEG <= END <= BUFFER_END. */ + if (!(BUF_BEGV_BYTE (buf) <= byte_beg + && byte_beg <= byte_end + && byte_end <= BUF_ZV_BYTE (buf))) + xsignal2 (Qargs_out_of_range, beg, end); + + TSNode ts_node = XTS_NODE (node)->node; + TSNode child; + if (NILP (named)) + child = ts_node_descendant_for_byte_range + (ts_node, byte_beg - 1 , byte_end - 1); + else + child = ts_node_named_descendant_for_byte_range + (ts_node, byte_beg - 1, byte_end - 1); + + if (ts_node_is_null(child)) + return Qnil; + + return make_ts_node(XTS_NODE (node)->parser, child); +} + /* Query functions */ +Lisp_Object ts_query_error_to_string (TSQueryError error) +{ + char *error_name; + switch (error) + { + case TSQueryErrorNone: + error_name = "none"; + break; + case TSQueryErrorSyntax: + error_name = "syntax"; + break; + case TSQueryErrorNodeType: + error_name = "node type"; + break; + case TSQueryErrorField: + error_name = "field"; + break; + case TSQueryErrorCapture: + error_name = "capture"; + break; + case TSQueryErrorStructure: + error_name = "structure"; + break; + } + return make_pure_c_string (error_name, strlen(error_name)); +} + +DEFUN ("tree-sitter-query-capture", + Ftree_sitter_query_capture, + Stree_sitter_query_capture, 2, 4, 0, + doc: /* Query NODE with PATTERN. + +Returns a list of (CAPTURE_NAME . NODE). CAPTURE_NAME is the name +assigned to the node in PATTERN. NODE is the captured node. + +PATTERN is a string containing one or more matching patterns. See +manual for further explanation for how to write a match pattern. + +BEG and END, if _both_ non-nil, specifies the range in which the query +is executed. + +Return nil if the query failed. */) + (Lisp_Object node, Lisp_Object pattern, + Lisp_Object beg, Lisp_Object end) +{ + CHECK_TS_NODE (node); + CHECK_STRING (pattern); + + TSNode ts_node = XTS_NODE (node)->node; + Lisp_Object lisp_parser = XTS_NODE (node)->parser; + const TSLanguage *lang = ts_parser_language + (XTS_PARSER (lisp_parser)->parser); + char *source = SSDATA (pattern); + + uint32_t error_offset; + uint32_t error_type; + TSQuery *query = ts_query_new (lang, source, strlen (source), + &error_offset, &error_type); + TSQueryCursor *cursor = ts_query_cursor_new (); + + if (query == NULL) + { + // FIXME: Signal an error? + return Qnil; + } + if (!NILP (beg) && !NILP (end)) + { + EMACS_INT beg_byte = XFIXNUM (beg); + EMACS_INT end_byte = XFIXNUM (end); + ts_query_cursor_set_byte_range + (cursor, (uint32_t) beg_byte - 1, (uint32_t) end_byte - 1); + } + + ts_query_cursor_exec (cursor, query, ts_node); + TSQueryMatch match; + TSQueryCapture capture; + Lisp_Object result = Qnil; + Lisp_Object entry; + Lisp_Object captured_node; + const char *capture_name; + uint32_t capture_name_len; + while (ts_query_cursor_next_match (cursor, &match)) + { + const TSQueryCapture *captures = match.captures; + for (int idx=0; idx < match.capture_count; idx++) + { + capture = captures[idx]; + captured_node = make_ts_node(lisp_parser, capture.node); + capture_name = ts_query_capture_name_for_id + (query, capture.index, &capture_name_len); + entry = Fcons (intern_c_string (capture_name), + captured_node); + result = Fcons (entry, result); + } + } + ts_query_delete (query); + ts_query_cursor_delete (cursor); + return Freverse (result); +} + /* Initialize the tree-sitter routines. */ void syms_of_tree_sitter (void) @@ -443,11 +702,18 @@ syms_of_tree_sitter (void) DEFSYM (Qnamed, "named"); DEFSYM (Qmissing, "missing"); DEFSYM (Qextra, "extra"); + DEFSYM (Qhas_changes, "has-changes"); DEFSYM (Qhas_error, "has-error"); + DEFSYM (Qtree_sitter_query_error, "tree-sitter-query-error"); + Fput (Qtree_sitter_query_error, Qerror_conditions, + pure_list (Qtree_sitter_query_error, Qerror)); + Fput (Qtree_sitter_query_error, Qerror_message, + build_pure_c_string ("Error with query pattern")) + DEFSYM (Qtree_sitter_parser_list, "tree-sitter-parser-list"); - DEFVAR_LISP ("ts-parser-list", Vtree_sitter_parser_list, - doc: /* A list of tree-sitter parsers. + DEFVAR_LISP ("tree-sitter-parser-list", Vtree_sitter_parser_list, + doc: /* A list of tree-sitter parsers. // TODO: more doc. If you removed a parser from this list, do not put it back in. */); Vtree_sitter_parser_list = Qnil; @@ -455,11 +721,19 @@ syms_of_tree_sitter (void) defsubr (&Stree_sitter_parser_p); defsubr (&Stree_sitter_node_p); + + defsubr (&Stree_sitter_node_parser); + defsubr (&Stree_sitter_create_parser); + defsubr (&Stree_sitter_parser_buffer); + defsubr (&Stree_sitter_parser_name); + defsubr (&Stree_sitter_parser_root_node); - defsubr (&Stree_sitter_parse); + defsubr (&Stree_sitter_parse_string); defsubr (&Stree_sitter_node_type); + defsubr (&Stree_sitter_node_start_byte); + defsubr (&Stree_sitter_node_end_byte); defsubr (&Stree_sitter_node_string); defsubr (&Stree_sitter_node_parent); defsubr (&Stree_sitter_node_child); @@ -469,4 +743,8 @@ syms_of_tree_sitter (void) defsubr (&Stree_sitter_node_child_by_field_name); defsubr (&Stree_sitter_node_next_sibling); defsubr (&Stree_sitter_node_prev_sibling); + defsubr (&Stree_sitter_node_first_child_for_byte); + defsubr (&Stree_sitter_node_descendant_for_byte_range); + + defsubr (&Stree_sitter_query_capture); } diff --git a/src/tree_sitter.h b/src/tree_sitter.h index a7e2a2d670..e9b4a71326 100644 --- a/src/tree_sitter.h +++ b/src/tree_sitter.h @@ -33,6 +33,7 @@ #define EMACS_TREE_SITTER_H struct Lisp_TS_Parser { union vectorlike_header header; + Lisp_Object name; struct buffer *buffer; TSParser *parser; TSTree *tree; @@ -95,7 +96,8 @@ ts_record_change (ptrdiff_t start_byte, ptrdiff_t old_end_byte, ptrdiff_t new_end_byte); Lisp_Object -make_ts_parser (struct buffer *buffer, TSParser *parser, TSTree *tree); +make_ts_parser (struct buffer *buffer, TSParser *parser, + TSTree *tree, Lisp_Object name); Lisp_Object make_ts_node (Lisp_Object parser, TSNode node); diff --git a/test/src/tree-sitter-tests.el b/test/src/tree-sitter-tests.el index cb1c464d3a..c61ad678d2 100644 --- a/test/src/tree-sitter-tests.el +++ b/test/src/tree-sitter-tests.el @@ -21,6 +21,7 @@ (require 'ert) (require 'tree-sitter-json) +(require 'tree-sitter) (ert-deftest tree-sitter-basic-parsing () "Test basic parsing routines." @@ -52,12 +53,13 @@ tree-sitter-basic-parsing (ert-deftest tree-sitter-node-api () "Tests for node API." (with-temp-buffer - (insert "[1,2,{\"name\": \"Bob\"},3]") (let (parser root-node doc-node object-node pair-node) - (setq parser (tree-sitter-create-parser - (current-buffer) (tree-sitter-json))) - (setq root-node (tree-sitter-parser-root-node - parser)) + (progn + (insert "[1,2,{\"name\": \"Bob\"},3]") + (setq parser (tree-sitter-create-parser + (current-buffer) (tree-sitter-json))) + (setq root-node (tree-sitter-parser-root-node + parser))) ;; `tree-sitter-node-type'. (should (eq 'document (tree-sitter-node-type root-node))) ;; `tree-sitter-node-check'. @@ -100,7 +102,51 @@ tree-sitter-node-api (should (equal "(\",\")" (tree-sitter-node-string (tree-sitter-node-prev-sibling object-node)))) - ))) + ;; `tree-sitter-node-first-child-for-byte'. + (should (equal "(number)" + (tree-sitter-node-string + (tree-sitter-node-first-child-for-byte + doc-node 3 t)))) + (should (equal "(\",\")" + (tree-sitter-node-string + (tree-sitter-node-first-child-for-byte + doc-node 3)))) + ;; `tree-sitter-node-descendant-for-byte-range'. + (should (equal "(\"{\")" + (tree-sitter-node-string + (tree-sitter-node-descendant-for-byte-range + root-node 6 7)))) + (should (equal "(object (pair key: (string (string_content)) value: (string (string_content))))" + (tree-sitter-node-string + (tree-sitter-node-descendant-for-byte-range + root-node 6 7 t))))))) + +(ert-deftest tree-sitter-query-api () + "Tests for query API." + (with-temp-buffer + (let (parser root-node pattern doc-node object-node pair-node) + (progn + (insert "[1,2,{\"name\": \"Bob\"},3]") + (setq parser (tree-sitter-create-parser + (current-buffer) (tree-sitter-json))) + (setq root-node (tree-sitter-parser-root-node + parser)) + (setq pattern "(string) @string +(pair key: (_) @keyword) +(number) @number")) + + (should + (equal + '((number . "1") (number . "2") + (keyword . "\"name\"") + (string . "\"name\"") + (string . "\"Bob\"") + (number . "3")) + (mapcar (lambda (entry) + (cons (car entry) + (tree-sitter-node-content + (cdr entry)))) + (tree-sitter-query-capture root-node pattern))))))) (provide 'tree-sitter-tests) ;;; tree-sitter-tests.el ends here -- 2.24.3 (Apple Git-128) ^ permalink raw reply related [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-24 15:04 ` Yuan Fu @ 2021-07-24 15:48 ` Eli Zaretskii 2021-07-24 17:14 ` Yuan Fu 2021-07-26 14:38 ` Perry E. Metzger 2021-07-24 16:14 ` Eli Zaretskii 1 sibling, 2 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-07-24 15:48 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Sat, 24 Jul 2021 11:04:35 -0400 > Cc: Stefan Monnier <monnier@iro.umontreal.ca>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > emacs-devel@gnu.org > > I wrote a simple interface between font-lock and tree-sitter, and it works pretty well: using tree-sitter for fontification, xdisp.c opens a lot faster, and scrolling through the buffer is also perceivably faster. My simple interface works like this: tree-sitter allow you to “pattern match” nodes in the parse tree with a DSL, and assign names to the matched nodes, e.g., given a pattern, you get back a list of (NAME . MATCHED-NODE). And if we use font-lock faces as names for those nodes, we get back a list of (FACE . MATCHED-NODE) from tree-sitter, and Emacs can simply look at the beginning and end of the node, and apply FACE to that range. For flexibility, FACE can also be a function, in which case the function is called with the node. This interface is basically what emacs-tree-sitt er does (I don’t know if they allow a capture name to be a function, though.) > > I have an example major-mode for C that uses tree-sitter for font-locking at the end of tree-sitter.el. > > Main functions to look at: tree-sitter-query-capture in tree_sitter.c, and tree-sitter-fontify-region-function in tree-sitter.el. Thanks! > BTW, what is the best way to signal a lisp error from C? I tried xsignal2, signal_error, error and friends but they seem to crash Emacs. Maybe I wasn’t using them correctly. xsignal2 should work, as should xsignal. Please show the code which crashed. > IIUC if we want tree-sitter to use our malloc, we need to build it with Emacs, where should I put the source of tree-sitter? tree-sitter itself should be a library we link against. If you meant the tree-sitter support code, then it should go on a separate file in src/. Or did I misunderstand your question? > What’s the different between make_string and make_pure_c_string? I’ve seen this “pure” thing else where, what does “pure” mean? I suggest to read the node "Pure Storage" in the ELisp manual. It explains that. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-24 15:48 ` Eli Zaretskii @ 2021-07-24 17:14 ` Yuan Fu 2021-07-24 17:20 ` Eli Zaretskii 2021-07-26 14:38 ` Perry E. Metzger 1 sibling, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-07-24 17:14 UTC (permalink / raw) To: Eli Zaretskii; +Cc: cpitclaudel, Stefan Monnier, emacs-devel [-- Attachment #1: Type: text/plain, Size: 667 bytes --] > >> IIUC if we want tree-sitter to use our malloc, we need to build it with Emacs, where should I put the source of tree-sitter? > > tree-sitter itself should be a library we link against. If you meant > the tree-sitter support code, then it should go on a separate file in > src/. Or did I misunderstand your question? If we link against libtree-sitter, how do we change its malloc behavior? Tree-sitter has these kind of things: #ifndef ts_malloc #define ts_malloc ts_malloc_default #endif So I assume we need to define ts_malloc to, say, xmalloc when compiling libtree-sitter. And if we only link to it, we can’t redefine ts_malloc. Yuan [-- Attachment #2: Type: text/html, Size: 4452 bytes --] ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-24 17:14 ` Yuan Fu @ 2021-07-24 17:20 ` Eli Zaretskii 2021-07-24 17:40 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-24 17:20 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Sat, 24 Jul 2021 13:14:50 -0400 > Cc: Stefan Monnier <monnier@iro.umontreal.ca>, > cpitclaudel@gmail.com, > emacs-devel@gnu.org > > IIUC if we want tree-sitter to use our malloc, we need to build it with Emacs, where should I put the > source of tree-sitter? > > tree-sitter itself should be a library we link against. If you meant > the tree-sitter support code, then it should go on a separate file in > src/. Or did I misunderstand your question? > > If we link against libtree-sitter, how do we change its malloc behavior? Tree-sitter has these kind of things: > > #ifndef ts_malloc > #define ts_malloc ts_malloc_default > #endif > > So I assume we need to define ts_malloc to, say, xmalloc when compiling libtree-sitter. And if we only link to > it, we can’t redefine ts_malloc. How does TS propose the client projects to do that? Are you saying that the only way to replace its malloc is to recompile tree-sitter?? ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-24 17:20 ` Eli Zaretskii @ 2021-07-24 17:40 ` Yuan Fu 2021-07-24 17:46 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-07-24 17:40 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Clément Pit-Claudel, Stefan Monnier, emacs-devel [-- Attachment #1: Type: text/plain, Size: 1840 bytes --] > On Jul 24, 2021, at 1:20 PM, Eli Zaretskii <eliz@gnu.org> wrote: > >> From: Yuan Fu <casouri@gmail.com <mailto:casouri@gmail.com>> >> Date: Sat, 24 Jul 2021 13:14:50 -0400 >> Cc: Stefan Monnier <monnier@iro.umontreal.ca <mailto:monnier@iro.umontreal.ca>>, >> cpitclaudel@gmail.com <mailto:cpitclaudel@gmail.com>, >> emacs-devel@gnu.org <mailto:emacs-devel@gnu.org> >> >> IIUC if we want tree-sitter to use our malloc, we need to build it with Emacs, where should I put the >> source of tree-sitter? >> >> tree-sitter itself should be a library we link against. If you meant >> the tree-sitter support code, then it should go on a separate file in >> src/. Or did I misunderstand your question? >> >> If we link against libtree-sitter, how do we change its malloc behavior? Tree-sitter has these kind of things: >> >> #ifndef ts_malloc >> #define ts_malloc ts_malloc_default >> #endif >> >> So I assume we need to define ts_malloc to, say, xmalloc when compiling libtree-sitter. And if we only link to >> it, we can’t redefine ts_malloc. > > How does TS propose the client projects to do that? Are you saying > that the only way to replace its malloc is to recompile tree-sitter?? Here is the relevant lines in alloc.h in tree-sitter: // Allow clients to override allocation functions #ifndef ts_malloc #define ts_malloc ts_malloc_default #endif #ifndef ts_calloc #define ts_calloc ts_calloc_default #endif #ifndef ts_realloc #define ts_realloc ts_realloc_default #endif #ifndef ts_free #define ts_free ts_free_default #endif I’m not a C expert, does this allow us to replace its malloc in runtime? Relative discussion found on the issue tracker: https://github.com/tree-sitter/tree-sitter/issues/739 <https://github.com/tree-sitter/tree-sitter/issues/739> Yuan [-- Attachment #2: Type: text/html, Size: 4793 bytes --] ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-24 17:40 ` Yuan Fu @ 2021-07-24 17:46 ` Eli Zaretskii 2021-07-24 18:06 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-24 17:46 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Sat, 24 Jul 2021 13:40:28 -0400 > Cc: Stefan Monnier <monnier@iro.umontreal.ca>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > emacs-devel@gnu.org > > How does TS propose the client projects to do that? Are you saying > that the only way to replace its malloc is to recompile tree-sitter?? > > Here is the relevant lines in alloc.h in tree-sitter: > > // Allow clients to override allocation functions > > #ifndef ts_malloc > #define ts_malloc ts_malloc_default > #endif > #ifndef ts_calloc > #define ts_calloc ts_calloc_default > #endif > #ifndef ts_realloc > #define ts_realloc ts_realloc_default > #endif > #ifndef ts_free > #define ts_free ts_free_default > #endif > > I’m not a C expert, does this allow us to replace its malloc in runtime? No, not AFAIU. It only allows to make such changes when TS is compiled. We should ask the TS developers to provide a way of specifying custom memory allocation/release function as part of TS initialization. It is a feature many packages provide. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-24 17:46 ` Eli Zaretskii @ 2021-07-24 18:06 ` Yuan Fu 2021-07-24 18:21 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-07-24 18:06 UTC (permalink / raw) To: Eli Zaretskii; +Cc: cpitclaudel, Stefan Monnier, emacs-devel > We should ask the TS developers to provide a way of specifying custom > memory allocation/release function as part of TS initialization. It > is a feature many packages provide. I commented on tree-sitter’s 1.0 checklist. > Isn't there a better way of updating those than manually take them out > of the TS grammar? Maybe write a short program linked against TS that > would spill them in some format that's convenient to use? Manual > updates are a serious maintenance burden. How does this convenient format looks like, in your mind? The grammar definition is already the “source”, I don’t see a way to magically make it easier to work with. What does “manual updates” refer to? If you mean updating patterns like (init_declarator declarator: (identifier) @font-lock-variable-name-face) (parameter_declaration declarator: (identifier) @font-lock-variable-name-face) when a language’s grammar changes, I don’t think we need to update them often, or ever. And It is not harder than updating font-lock-keywords when a language adds a new fancy syntax. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-24 18:06 ` Yuan Fu @ 2021-07-24 18:21 ` Eli Zaretskii 2021-07-24 18:55 ` Stefan Monnier 2021-07-25 18:44 ` Stephen Leake 0 siblings, 2 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-07-24 18:21 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Sat, 24 Jul 2021 14:06:52 -0400 > Cc: Stefan Monnier <monnier@iro.umontreal.ca>, > cpitclaudel@gmail.com, > emacs-devel@gnu.org > > > We should ask the TS developers to provide a way of specifying custom > > memory allocation/release function as part of TS initialization. It > > is a feature many packages provide. > > I commented on tree-sitter’s 1.0 checklist. Thanks. > > Isn't there a better way of updating those than manually take them out > > of the TS grammar? Maybe write a short program linked against TS that > > would spill them in some format that's convenient to use? Manual > > updates are a serious maintenance burden. > > How does this convenient format looks like, in your mind? The grammar definition is already the “source”, I don’t see a way to magically make it easier to work with. What does “manual updates” refer to? If you mean updating patterns like > > (init_declarator > declarator: (identifier) @font-lock-variable-name-face) > > (parameter_declaration > declarator: (identifier) @font-lock-variable-name-face) > > when a language’s grammar changes, I don’t think we need to update them often, or ever. And It is not harder than updating font-lock-keywords when a language adds a new fancy syntax. It isn't an immediate problem, so we can delay it for later. However, I do worry about the ability to update this in some non-manual way. Take for example the way we update our character databases when Unicode adds more characters/scripts: we use the data files distributed by the Unicode Consortium and process them with scripts in admin/unidata to produce intermediate files in a format convenient for processing by Emacs, then we process those intermediate files as part of building Emacs. Unicode files change maybe or twice a year, but still, doing all those changes manually would be a burden. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-24 18:21 ` Eli Zaretskii @ 2021-07-24 18:55 ` Stefan Monnier 2021-07-25 18:44 ` Stephen Leake 1 sibling, 0 replies; 370+ messages in thread From: Stefan Monnier @ 2021-07-24 18:55 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Yuan Fu, cpitclaudel, emacs-devel > However, I do worry about the ability to update this in some > non-manual way. It has to be manual to the extent that it's not something that is inherent to the BNF grammar. The rules could accompany the grammar (and TS could give access to them), in which case presumably all editors using TS would end up fontifying in the same way, which would make a fair bit of sense. But in any case, this seems to be a preoccupation that goes much beyond the actual immediate integration of tree-sitter into Emacs, and concerns instead the evolution of tree-sitter itself. Stefan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-24 18:21 ` Eli Zaretskii 2021-07-24 18:55 ` Stefan Monnier @ 2021-07-25 18:44 ` Stephen Leake 1 sibling, 0 replies; 370+ messages in thread From: Stephen Leake @ 2021-07-25 18:44 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Yuan Fu, emacs-devel, cpitclaudel, monnier Eli Zaretskii <eliz@gnu.org> writes: >> From: Yuan Fu <casouri@gmail.com> >> >> when a language’s grammar changes, I don’t think we need to update >> them often, or ever. And It is not harder than updating >> font-lock-keywords when a language adds a new fancy syntax. > > It isn't an immediate problem, so we can delay it for later. > > However, I do worry about the ability to update this in some > non-manual way. Take for example the way we update our character > databases when Unicode adds more characters/scripts: we use the data > files distributed by the Unicode Consortium If the language concerned has some standard definition that is machine readable, then we could get partway there. But no language standard specifies fontification (or indent), so there is no standard machine-readable description of these. tree-sitter provides a defacto standard for specifying fontification (in the highlight rules files); it would make sense for emacs to be able to read those files directly, along with linking to the corresponding tree-sitter parser. There would have to provide a separate mapping from tree-sitter notation to emacs font names. wisi provides a mechanism to describe fontification and indentation in the grammar source file; every time ISO releases a new Ada language version, I have to compare them and incorporate the changes. I've written code to partly automate this (the language reference manual contains the grammar in a variant of EBNF in an appendix, which is mostly machine-readable), but it's highly Ada specific, and still mostly a manual process. Fortunately it only happens every 10 years for Ada :). Many languages provide some EBNF description of the language, but it is often in a form that is not suitable for whatever parser generator you are using; it is usually optimized for human understanding. I made it a requirement for the wisi parser generator to use the Ada reference grammar as closely as possible, but I still have to modify the grammar to get reasonable performance. You can find many different Java grammars on the web, optimized for different parser generators (there are two different but nominally equivalent grammars in the Java docs). -- -- Stephe ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-24 15:48 ` Eli Zaretskii 2021-07-24 17:14 ` Yuan Fu @ 2021-07-26 14:38 ` Perry E. Metzger 1 sibling, 0 replies; 370+ messages in thread From: Perry E. Metzger @ 2021-07-26 14:38 UTC (permalink / raw) To: emacs-devel On 7/24/21 11:48, Eli Zaretskii wrote: >> From: Yuan Fu <casouri@gmail.com> >> >> >> IIUC if we want tree-sitter to use our malloc, we need to build it with Emacs, where should I put the source of tree-sitter? > tree-sitter itself should be a library we link against. If you meant > the tree-sitter support code, then it should go on a separate file in > src/. Or did I misunderstand your question? > I suspect that the authors' expectations are that enough things need to be tweaked that a given editor project like Emacs probably would want to recompile Tree Sitter for use in their system. I'm not 100% sure about that, but it seems to be what they're assuming. Perry ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-24 15:04 ` Yuan Fu 2021-07-24 15:48 ` Eli Zaretskii @ 2021-07-24 16:14 ` Eli Zaretskii 2021-07-24 17:32 ` Yuan Fu 1 sibling, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-24 16:14 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Sat, 24 Jul 2021 11:04:35 -0400 > Cc: Stefan Monnier <monnier@iro.umontreal.ca>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > emacs-devel@gnu.org > > +(define-derived-mode ts-c-mode prog-mode "TS C" > + "C mode with tree-sitter support." > + (setq-local font-lock-fontify-region-function > + #'tree-sitter-fontify-region-function) > + (setq-local tree-sitter-font-lock-settings > + `(("font-lock-c" > + ,(tree-sitter-c) > + "(null) @font-lock-constant-face > +(true) @font-lock-constant-face > +(false) @font-lock-constant-face > + > +(comment) @font-lock-comment-face > + > +(system_lib_string) @ts-c-fontify-system-lib > + > +(unary_expression > + operator: _ @font-lock-negation-char-face) > + > +(string_literal) @font-lock-string-face > +(char_literal) @font-lock-string-face Where does this repertoire of possible syntax categories come from? Is this from some list that TS exposes or documents? If so, what happens when the repertoire is modified? > beg = (char *) BUF_BYTE_ADDRESS (buffer, byte_pos); > - len = next_char_len(byte_pos); > + len = BYTES_BY_CHAR_HEAD ((int) beg); The last line is wrong: you need the byte itself. So it should be: len = BYTES_BY_CHAR_HEAD (*beg); ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-24 16:14 ` Eli Zaretskii @ 2021-07-24 17:32 ` Yuan Fu 2021-07-24 17:42 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-07-24 17:32 UTC (permalink / raw) To: Eli Zaretskii; +Cc: cpitclaudel, monnier, emacs-devel > On Jul 24, 2021, at 12:14 PM, Eli Zaretskii <eliz@gnu.org> wrote: > >> From: Yuan Fu <casouri@gmail.com> >> Date: Sat, 24 Jul 2021 11:04:35 -0400 >> Cc: Stefan Monnier <monnier@iro.umontreal.ca>, >> Clément Pit-Claudel <cpitclaudel@gmail.com>, >> emacs-devel@gnu.org >> >> +(define-derived-mode ts-c-mode prog-mode "TS C" >> + "C mode with tree-sitter support." >> + (setq-local font-lock-fontify-region-function >> + #'tree-sitter-fontify-region-function) >> + (setq-local tree-sitter-font-lock-settings >> + `(("font-lock-c" >> + ,(tree-sitter-c) >> + "(null) @font-lock-constant-face >> +(true) @font-lock-constant-face >> +(false) @font-lock-constant-face >> + >> +(comment) @font-lock-comment-face >> + >> +(system_lib_string) @ts-c-fontify-system-lib >> + >> +(unary_expression >> + operator: _ @font-lock-negation-char-face) >> + >> +(string_literal) @font-lock-string-face >> +(char_literal) @font-lock-string-face > > Where does this repertoire of possible syntax categories come from? > Is this from some list that TS exposes or documents? If so, what > happens when the repertoire is modified? These “syntax categories” are defined by individual language grammar definition for tree-sitter, so it could change from language to language. And tree-sitter does not document them. If these “syntax categories” change, then we need to change our code with them. But I doubt that it will happen often. They are hard to document, because a non-trivial grammar definition often defines hundreds of them; the grammar definition for C has 1000 LOC. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-24 17:32 ` Yuan Fu @ 2021-07-24 17:42 ` Eli Zaretskii 0 siblings, 0 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-07-24 17:42 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Sat, 24 Jul 2021 13:32:18 -0400 > Cc: monnier@iro.umontreal.ca, > cpitclaudel@gmail.com, > emacs-devel@gnu.org > > >> +(define-derived-mode ts-c-mode prog-mode "TS C" > >> + "C mode with tree-sitter support." > >> + (setq-local font-lock-fontify-region-function > >> + #'tree-sitter-fontify-region-function) > >> + (setq-local tree-sitter-font-lock-settings > >> + `(("font-lock-c" > >> + ,(tree-sitter-c) > >> + "(null) @font-lock-constant-face > >> +(true) @font-lock-constant-face > >> +(false) @font-lock-constant-face > >> + > >> +(comment) @font-lock-comment-face > >> + > >> +(system_lib_string) @ts-c-fontify-system-lib > >> + > >> +(unary_expression > >> + operator: _ @font-lock-negation-char-face) > >> + > >> +(string_literal) @font-lock-string-face > >> +(char_literal) @font-lock-string-face > > > > Where does this repertoire of possible syntax categories come from? > > Is this from some list that TS exposes or documents? If so, what > > happens when the repertoire is modified? > > These “syntax categories” are defined by individual language grammar definition for tree-sitter, so it could change from language to language. And tree-sitter does not document them. If these “syntax categories” change, then we need to change our code with them. But I doubt that it will happen often. They are hard to document, because a non-trivial grammar definition often defines hundreds of them; the grammar definition for C has 1000 LOC. Isn't there a better way of updating those than manually take them out of the TS grammar? Maybe write a short program linked against TS that would spill them in some format that's convenient to use? Manual updates are a serious maintenance burden. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-22 17:00 ` Eli Zaretskii 2021-07-22 17:47 ` Yuan Fu @ 2021-07-23 14:07 ` Stefan Monnier 2021-07-23 14:45 ` Yuan Fu 2021-07-23 19:13 ` Eli Zaretskii 2021-07-24 9:42 ` Stephen Leake 2 siblings, 2 replies; 370+ messages in thread From: Stefan Monnier @ 2021-07-23 14:07 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Yuan Fu, cpitclaudel, emacs-devel > But that's how the current font-lock and indentation work: they never > look beyond the narrowing limits. Not quite: that's true for indentation, but for font-lock we have `font-lock-dont-widen` (i.e. by default, font-lock widens temporarily while it does its job). For TS, given the cost associated with changing the bounds, I think it would make a lot of sense to ignore narrowing (and maybe provide some separate way to specify bounds, for the rare cases like Info and Rmail where a buffer contains "a collection of things" and we only want to parse/manipulate one of those things at any given time). Stefan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-23 14:07 ` Stefan Monnier @ 2021-07-23 14:45 ` Yuan Fu 2021-07-23 19:13 ` Eli Zaretskii 1 sibling, 0 replies; 370+ messages in thread From: Yuan Fu @ 2021-07-23 14:45 UTC (permalink / raw) To: Stefan Monnier; +Cc: Eli Zaretskii, Clément Pit-Claudel, emacs-devel > On Jul 23, 2021, at 10:07 AM, Stefan Monnier <monnier@iro.umontreal.ca> wrote: > >> But that's how the current font-lock and indentation work: they never >> look beyond the narrowing limits. > > Not quite: that's true for indentation, but for font-lock we have > `font-lock-dont-widen` (i.e. by default, font-lock widens temporarily > while it does its job). > > For TS, given the cost associated with changing the bounds, I think it > would make a lot of sense to ignore narrowing (and maybe provide some > separate way to specify bounds, for the rare cases like Info and Rmail > where a buffer contains "a collection of things" and we only want to > parse/manipulate one of those things at any given time). Tree-sitter lets you set ranges in which the parser works in. That’s how they support multi-language files like html+javascript+css. This will certainly work for Rmail and Info, too. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-23 14:07 ` Stefan Monnier 2021-07-23 14:45 ` Yuan Fu @ 2021-07-23 19:13 ` Eli Zaretskii 2021-07-23 20:28 ` Stefan Monnier 1 sibling, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-23 19:13 UTC (permalink / raw) To: Stefan Monnier; +Cc: casouri, cpitclaudel, emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: Yuan Fu <casouri@gmail.com>, cpitclaudel@gmail.com, emacs-devel@gnu.org > Date: Fri, 23 Jul 2021 10:07:42 -0400 > > > But that's how the current font-lock and indentation work: they never > > look beyond the narrowing limits. > > Not quite: that's true for indentation, but for font-lock we have > `font-lock-dont-widen` (i.e. by default, font-lock widens temporarily > while it does its job). jit-lock never requests fontifications outside of the accessible portion, because redisplay doesn't look there. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-23 19:13 ` Eli Zaretskii @ 2021-07-23 20:28 ` Stefan Monnier 2021-07-24 6:02 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Stefan Monnier @ 2021-07-23 20:28 UTC (permalink / raw) To: Eli Zaretskii; +Cc: casouri, cpitclaudel, emacs-devel >> > But that's how the current font-lock and indentation work: they never >> > look beyond the narrowing limits. >> Not quite: that's true for indentation, but for font-lock we have >> `font-lock-dont-widen` (i.e. by default, font-lock widens temporarily >> while it does its job). > jit-lock never requests fontifications outside of the accessible > portion, because redisplay doesn't look there. But font-lock may look (and fontify) beyond the narrowing, and when it calls `syntax-ppss` it will usually parse from 1 rather than from `point-min`. I'd expect jit/font-lock running on top of TS to behave similarly: the actual parsing is done over the widened buffer but the fontification is only applied to the visible part (or nearby). Stefan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-23 20:28 ` Stefan Monnier @ 2021-07-24 6:02 ` Eli Zaretskii 2021-07-24 14:19 ` Stefan Monnier 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-24 6:02 UTC (permalink / raw) To: Stefan Monnier; +Cc: casouri, cpitclaudel, emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: casouri@gmail.com, cpitclaudel@gmail.com, emacs-devel@gnu.org > Date: Fri, 23 Jul 2021 16:28:16 -0400 > > >> > But that's how the current font-lock and indentation work: they never > >> > look beyond the narrowing limits. > >> Not quite: that's true for indentation, but for font-lock we have > >> `font-lock-dont-widen` (i.e. by default, font-lock widens temporarily > >> while it does its job). > > jit-lock never requests fontifications outside of the accessible > > portion, because redisplay doesn't look there. > > But font-lock may look (and fontify) beyond the narrowing, and > when it calls `syntax-ppss` it will usually parse from 1 rather than > from `point-min`. Yes, and that's why I said that callers should call 'widen' if they need to do so. The question is what should TS reading do _by_default_. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-24 6:02 ` Eli Zaretskii @ 2021-07-24 14:19 ` Stefan Monnier 0 siblings, 0 replies; 370+ messages in thread From: Stefan Monnier @ 2021-07-24 14:19 UTC (permalink / raw) To: Eli Zaretskii; +Cc: casouri, cpitclaudel, emacs-devel >> But font-lock may look (and fontify) beyond the narrowing, and >> when it calls `syntax-ppss` it will usually parse from 1 rather than >> from `point-min`. > Yes, and that's why I said that callers should call 'widen' if they > need to do so. > The question is what should TS reading do _by_default_. Ah, then we're in violent agreement. The low-level interface with TS should access the text within the narrowed region. And the code that calls TS will usually want to widen beforehand. Stefan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-22 17:00 ` Eli Zaretskii 2021-07-22 17:47 ` Yuan Fu 2021-07-23 14:07 ` Stefan Monnier @ 2021-07-24 9:42 ` Stephen Leake 2021-07-24 11:22 ` Eli Zaretskii 2 siblings, 1 reply; 370+ messages in thread From: Stephen Leake @ 2021-07-24 9:42 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Yuan Fu, emacs-devel, cpitclaudel, monnier Eli Zaretskii <eliz@gnu.org> writes: >> From: Yuan Fu <casouri@gmail.com> >> >> Yes, I meant to discuss this. The problem with respecting narrowing >> is that, a user can freely narrow and widen arbitrarily, and Emacs >> needs to translate them into insertion & deletion of the buffer text >> for tree-sitter, every time a user narrows or widens the buffer. >> Plus, if tree-sitter respects narrowing, it could happen where a >> user narrows the buffer, the font-locking changes and is not correct >> anymore. Maybe that’s not the user want. Also, if someone narrows >> and widens often, maybe narrow to a function for better focus, >> tree-sitter needs to constantly re-parse most of the buffer. These >> are not significant disadvantages, but what do we get from >> respecting narrowing that justifies code complexity and these small >> annoyances? > > But that's how the current font-lock and indentation work: they never > look beyond the narrowing limits. And that's broken, unless the narrowing is for multi-major-mode. -- -- Stephe ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-24 9:42 ` Stephen Leake @ 2021-07-24 11:22 ` Eli Zaretskii 2021-07-25 18:21 ` Stephen Leake 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-24 11:22 UTC (permalink / raw) To: Stephen Leake; +Cc: casouri, emacs-devel, cpitclaudel, monnier > From: Stephen Leake <stephen_leake@stephe-leake.org> > Cc: Yuan Fu <casouri@gmail.com>, cpitclaudel@gmail.com, > monnier@iro.umontreal.ca, emacs-devel@gnu.org > Date: Sat, 24 Jul 2021 02:42:24 -0700 > > > But that's how the current font-lock and indentation work: they never > > look beyond the narrowing limits. > > And that's broken ??? Of course, it isn't: it's how Emacs has worked since v21.1. > unless the narrowing is for multi-major-mode. And what would you do in that case, if you allow TS to look beyond the restriction? ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-24 11:22 ` Eli Zaretskii @ 2021-07-25 18:21 ` Stephen Leake 2021-07-25 19:03 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Stephen Leake @ 2021-07-25 18:21 UTC (permalink / raw) To: Eli Zaretskii; +Cc: casouri, emacs-devel, cpitclaudel, monnier Eli Zaretskii <eliz@gnu.org> writes: >> From: Stephen Leake <stephen_leake@stephe-leake.org> >> Cc: Yuan Fu <casouri@gmail.com>, cpitclaudel@gmail.com, >> monnier@iro.umontreal.ca, emacs-devel@gnu.org >> Date: Sat, 24 Jul 2021 02:42:24 -0700 >> >> > But that's how the current font-lock and indentation work: they never >> > look beyond the narrowing limits. >> >> And that's broken > > ??? Of course, it isn't: it's how Emacs has worked since v21.1. Ada (and other languages, but not all) requires the full file text to properly compute font and indent; narrowing breaks that. The fix for font-lock is font-lock-dont-widen; I implemented a similar mechanism for indent of an ada-mode region in multi-major-mode. In plain ada-mode, indent is currently broken in a narrowed buffer; wisi-indent-region does not widen because it is language-agnostic, and I have not gotten around to implementing a "widen for indent" hook because I don't use narrowing very often, and no one has complained. >> unless the narrowing is for multi-major-mode. > > And what would you do in that case, if you allow TS to look beyond the > restriction? In the multi-major-mode case, there is a separate parser for each language, and each sub-mode region in the text would get its own parser tree (ie, it acts like a separate file), and that parser tree is only told about changes to those regions. So the parser will never try to look outside the region; it doesn't need to know about narrowing. I'll have to upgrade my Ada multi-major-mode implementation to do this for incremental parse. -- -- Stephe ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-25 18:21 ` Stephen Leake @ 2021-07-25 19:03 ` Eli Zaretskii 2021-07-26 16:40 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-25 19:03 UTC (permalink / raw) To: Stephen Leake; +Cc: casouri, emacs-devel, cpitclaudel, monnier > From: Stephen Leake <stephen_leake@stephe-leake.org> > Cc: casouri@gmail.com, cpitclaudel@gmail.com, monnier@iro.umontreal.ca, > emacs-devel@gnu.org > Date: Sun, 25 Jul 2021 11:21:27 -0700 > > >> > But that's how the current font-lock and indentation work: they never > >> > look beyond the narrowing limits. > >> > >> And that's broken > > > > ??? Of course, it isn't: it's how Emacs has worked since v21.1. > > Ada (and other languages, but not all) requires the full file text to > properly compute font and indent; narrowing breaks that. Not relevant: if a major mode's fontification code knows it needs to do that, it will call 'widen'. The issue was what should the TS reader function do. My firm opinion is that it should not look beyond the restriction, because it isn't its business to make those decisions. If the caller needs to widen, it will. > >> unless the narrowing is for multi-major-mode. > > > > And what would you do in that case, if you allow TS to look beyond the > > restriction? > > In the multi-major-mode case, there is a separate parser for each > language, and each sub-mode region in the text would get its own parser > tree (ie, it acts like a separate file), and that parser tree is only > told about changes to those regions. So the parser will never try to > look outside the region; it doesn't need to know about narrowing. Once again, we are talking about the function used by TS to read buffer text. Not about the parser or its caller. Low-level code, which knows nothing about the context, should never look beyond the restriction. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-25 19:03 ` Eli Zaretskii @ 2021-07-26 16:40 ` Yuan Fu 2021-07-26 16:49 ` Eli Zaretskii 2021-07-26 23:40 ` Ergus 0 siblings, 2 replies; 370+ messages in thread From: Yuan Fu @ 2021-07-26 16:40 UTC (permalink / raw) To: Eli Zaretskii Cc: Clément Pit-Claudel, Stephen Leake, Stefan Monnier, emacs-devel > >>>> unless the narrowing is for multi-major-mode. >>> >>> And what would you do in that case, if you allow TS to look beyond the >>> restriction? >> >> In the multi-major-mode case, there is a separate parser for each >> language, and each sub-mode region in the text would get its own parser >> tree (ie, it acts like a separate file), and that parser tree is only >> told about changes to those regions. So the parser will never try to >> look outside the region; it doesn't need to know about narrowing. > > Once again, we are talking about the function used by TS to read > buffer text. Not about the parser or its caller. Low-level code, > which knows nothing about the context, should never look beyond the > restriction. It doesn’t harm for tree-sitter to see the rest of the buffer, it doesn’t modify anything, all it does it reading the text. OTOH, restricting tree-sitter to the bounds of narrows adds complexity for no benefit (as far as I can see). Maybe narrowing is the context that low level code should ignore, or at least tree-sitter should ignore. The only benefit that I can think of is “we firmly adhere to the ‘contract’ that no one can look beyond the narrowed region”, but is it a good contract? Is there really a contract in the first place? IMO, narrowing acts like masking tapes over the rest of the buffer, so that user edits like re-replace wouldn’t spill out. Demanding everything in Emacs to not have access to the rest of the buffer is dogmatic (in the sense that it is too rigid and is simply following the doctrine blindly). And about language definitions and font-locking, I just realized that tree-sitter language definitions provides highlighting patterns, and we only need to minimally modify them to use them for Emacs, so there aren’t much manual effort involved. Also, anyone have thoughts on how should tree-sitter intergrate with font-lock beyond the current simple interface? Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-26 16:40 ` Yuan Fu @ 2021-07-26 16:49 ` Eli Zaretskii 2021-07-26 17:09 ` Yuan Fu 2021-07-26 18:32 ` chad 2021-07-26 23:40 ` Ergus 1 sibling, 2 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-07-26 16:49 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Mon, 26 Jul 2021 12:40:31 -0400 > Cc: Stephen Leake <stephen_leake@stephe-leake.org>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Stefan Monnier <monnier@iro.umontreal.ca>, > emacs-devel@gnu.org > > > Once again, we are talking about the function used by TS to read > > buffer text. Not about the parser or its caller. Low-level code, > > which knows nothing about the context, should never look beyond the > > restriction. > > It doesn’t harm for tree-sitter to see the rest of the buffer, it doesn’t modify anything, all it does it reading the text. OTOH, restricting tree-sitter to the bounds of narrows adds complexity for no benefit (as far as I can see). Which complexity does it add? You just compare with BEGV_BYTE instead of BEG_BYTE etc. If we let TS look where it wants, we will lose the ability to restrict it to a certain part of the buffer text. This is needed at least for some specialized modes, and is generally desirable, as it gives Lisp programs an easy way to impose such restrictions whenever they need. > Maybe narrowing is the context that low level code should ignore No other code in Emacs does, and for a good reason. > The only benefit that I can think of is “we firmly adhere to the ‘contract’ that no one can look beyond the narrowed region”, but is it a good contract? Is there really a contract in the first place? It served us very well until now, so yes, I think it's a good contract. > IMO, narrowing acts like masking tapes over the rest of the buffer, so that user edits like re-replace wouldn’t spill out. Demanding everything in Emacs to not have access to the rest of the buffer is dogmatic (in the sense that it is too rigid and is simply following the doctrine blindly). Again, this "dogma" is used and adhered everywhere else in Emacs by such low-level code. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-26 16:49 ` Eli Zaretskii @ 2021-07-26 17:09 ` Yuan Fu 2021-07-26 18:55 ` Eli Zaretskii 2021-07-27 6:13 ` Stephen Leake 2021-07-26 18:32 ` chad 1 sibling, 2 replies; 370+ messages in thread From: Yuan Fu @ 2021-07-26 17:09 UTC (permalink / raw) To: Eli Zaretskii; +Cc: cpitclaudel, Stephen Leake, monnier, emacs-devel >> >>> Once again, we are talking about the function used by TS to read >>> buffer text. Not about the parser or its caller. Low-level code, >>> which knows nothing about the context, should never look beyond the >>> restriction. >> >> It doesn’t harm for tree-sitter to see the rest of the buffer, it doesn’t modify anything, all it does it reading the text. OTOH, restricting tree-sitter to the bounds of narrows adds complexity for no benefit (as far as I can see). > > Which complexity does it add? You just compare with BEGV_BYTE instead > of BEG_BYTE etc. We need to “delete” the hidden text and “re-insert” when we widen the buffer. I’ll try to make it a no-op as long as we remember to widen before calling tree-sitter to parse anything. > > If we let TS look where it wants, we will lose the ability to restrict > it to a certain part of the buffer text. This is needed at least for > some specialized modes, and is generally desirable, as it gives Lisp > programs an easy way to impose such restrictions whenever they need. Tree-sitter lets you set ranges for a parser to limit it self within, in order to support multi-language files. > >> Maybe narrowing is the context that low level code should ignore > > No other code in Emacs does, and for a good reason. > >> The only benefit that I can think of is “we firmly adhere to the ‘contract’ that no one can look beyond the narrowed region”, but is it a good contract? Is there really a contract in the first place? > > It served us very well until now, so yes, I think it's a good > contract. > >> IMO, narrowing acts like masking tapes over the rest of the buffer, so that user edits like re-replace wouldn’t spill out. Demanding everything in Emacs to not have access to the rest of the buffer is dogmatic (in the sense that it is too rigid and is simply following the doctrine blindly). > > Again, this "dogma" is used and adhered everywhere else in Emacs by > such low-level code. Ok. I trust you to know better than I do. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-26 17:09 ` Yuan Fu @ 2021-07-26 18:55 ` Eli Zaretskii 2021-07-26 19:06 ` Yuan Fu 2021-07-27 6:13 ` Stephen Leake 1 sibling, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-26 18:55 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Mon, 26 Jul 2021 13:09:13 -0400 > Cc: Stephen Leake <stephen_leake@stephe-leake.org>, > cpitclaudel@gmail.com, > monnier@iro.umontreal.ca, > emacs-devel@gnu.org > > > Which complexity does it add? You just compare with BEGV_BYTE instead > > of BEG_BYTE etc. > > We need to “delete” the hidden text and “re-insert” when we widen the buffer. I’ll try to make it a no-op as long as we remember to widen before calling tree-sitter to parse anything. If some parser needs access to the whole buffer, its caller should widen the buffer before calling the parser. IOW, the control on which part of the buffer is visible to the parser should be on the level of the caller of the parser, not at the level of the function which accesses buffer text. > > If we let TS look where it wants, we will lose the ability to restrict > > it to a certain part of the buffer text. This is needed at least for > > some specialized modes, and is generally desirable, as it gives Lisp > > programs an easy way to impose such restrictions whenever they need. > > Tree-sitter lets you set ranges for a parser to limit it self within, in order to support multi-language files. That's okay, but why would we want to expose this to Lisp as the means to restrict the accessible portion, when we already have such a means? ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-26 18:55 ` Eli Zaretskii @ 2021-07-26 19:06 ` Yuan Fu 2021-07-26 19:19 ` Perry E. Metzger 2021-07-26 19:20 ` Eli Zaretskii 0 siblings, 2 replies; 370+ messages in thread From: Yuan Fu @ 2021-07-26 19:06 UTC (permalink / raw) To: Eli Zaretskii; +Cc: cpitclaudel, Stephen Leake, monnier, emacs-devel > >>> If we let TS look where it wants, we will lose the ability to restrict >>> it to a certain part of the buffer text. This is needed at least for >>> some specialized modes, and is generally desirable, as it gives Lisp >>> programs an easy way to impose such restrictions whenever they need. >> >> Tree-sitter lets you set ranges for a parser to limit it self within, in order to support multi-language files. > > That's okay, but why would we want to expose this to Lisp as the means > to restrict the accessible portion, when we already have such a means? Tree-sitter lets you set multiple discontinuous ranges, whereas narrowing can only narrow to a single continuous range. Multiple discontinuous range is much more useful for HTML+CSS+JS, or PHP + HML. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-26 19:06 ` Yuan Fu @ 2021-07-26 19:19 ` Perry E. Metzger 2021-07-26 19:31 ` Eli Zaretskii 2021-07-26 19:20 ` Eli Zaretskii 1 sibling, 1 reply; 370+ messages in thread From: Perry E. Metzger @ 2021-07-26 19:19 UTC (permalink / raw) To: emacs-devel On 7/26/21 15:06, Yuan Fu wrote: > Tree-sitter lets you set multiple discontinuous ranges, whereas > narrowing can only narrow to a single continuous range. Multiple > discontinuous range is much more useful for HTML+CSS+JS, or PHP + HML. Other obvious uses: restructured text or markdown documentation amidst code in another language, various sorts of literate programming, etc. (This of course brings up that someday it might be nice to have Emacs aware of such multi-modal text and able to switch how you're editing even inside a single file, but that's a bigger topic.) Perry ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-26 19:19 ` Perry E. Metzger @ 2021-07-26 19:31 ` Eli Zaretskii 0 siblings, 0 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-07-26 19:31 UTC (permalink / raw) To: Perry E. Metzger; +Cc: emacs-devel > Date: Mon, 26 Jul 2021 15:19:20 -0400 > From: "Perry E. Metzger" <perry@piermont.com> > > Other obvious uses: restructured text or markdown documentation amidst > code in another language, various sorts of literate programming, etc. We should, of course, support these features. But their support should be controlled by Lisp programs, not be hard-coded in some low-level C code. The way to access discontinuous ranges of buffer text as a single character sequence needs support in Emacs Lisp before we can map it to the equivalent TS features. > (This of course brings up that someday it might be nice to have Emacs > aware of such multi-modal text and able to switch how you're editing > even inside a single file, but that's a bigger topic.) We have the beginning of this, but have a lot more turf to cover. And currently, what we have uses restrictions. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-26 19:06 ` Yuan Fu 2021-07-26 19:19 ` Perry E. Metzger @ 2021-07-26 19:20 ` Eli Zaretskii 2021-07-26 19:45 ` Yuan Fu 1 sibling, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-26 19:20 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Mon, 26 Jul 2021 15:06:14 -0400 > Cc: Stephen Leake <stephen_leake@stephe-leake.org>, > cpitclaudel@gmail.com, > monnier@iro.umontreal.ca, > emacs-devel@gnu.org > > > > >>> If we let TS look where it wants, we will lose the ability to restrict > >>> it to a certain part of the buffer text. This is needed at least for > >>> some specialized modes, and is generally desirable, as it gives Lisp > >>> programs an easy way to impose such restrictions whenever they need. > >> > >> Tree-sitter lets you set ranges for a parser to limit it self within, in order to support multi-language files. > > > > That's okay, but why would we want to expose this to Lisp as the means > > to restrict the accessible portion, when we already have such a means? > > Tree-sitter lets you set multiple discontinuous ranges, whereas narrowing can only narrow to a single continuous range. Multiple discontinuous range is much more useful for HTML+CSS+JS, or PHP + HML. I understand. But forcing various Emacs features to use these ranges where a simple restriction will do makes little sense. Last time something like these discontinuous ranges was discussed as a general feature in Emacs, we couldn't come up with an agreed-upon design and implementation. So adding something like that to Emacs is not an easy job. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-26 19:20 ` Eli Zaretskii @ 2021-07-26 19:45 ` Yuan Fu 2021-07-26 19:57 ` Dmitry Gutov 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-07-26 19:45 UTC (permalink / raw) To: Eli Zaretskii; +Cc: cpitclaudel, Stephen Leake, monnier, emacs-devel >>>>> If we let TS look where it wants, we will lose the ability to restrict >>>>> it to a certain part of the buffer text. This is needed at least for >>>>> some specialized modes, and is generally desirable, as it gives Lisp >>>>> programs an easy way to impose such restrictions whenever they need. >>>> >>>> Tree-sitter lets you set ranges for a parser to limit it self within, in order to support multi-language files. >>> >>> That's okay, but why would we want to expose this to Lisp as the means >>> to restrict the accessible portion, when we already have such a means? >> >> Tree-sitter lets you set multiple discontinuous ranges, whereas narrowing can only narrow to a single continuous range. Multiple discontinuous range is much more useful for HTML+CSS+JS, or PHP + HML. > > I understand. But forcing various Emacs features to use these ranges > where a simple restriction will do makes little sense. > > Last time something like these discontinuous ranges was discussed as a > general feature in Emacs, we couldn't come up with an agreed-upon > design and implementation. So adding something like that to Emacs is > not an easy job. We can provide both. Those who needs the more powerful ranges could use that, and those who don’t can use narrowing. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-26 19:45 ` Yuan Fu @ 2021-07-26 19:57 ` Dmitry Gutov 0 siblings, 0 replies; 370+ messages in thread From: Dmitry Gutov @ 2021-07-26 19:57 UTC (permalink / raw) To: Yuan Fu, Eli Zaretskii; +Cc: cpitclaudel, Stephen Leake, monnier, emacs-devel On 26.07.2021 22:45, Yuan Fu wrote: >> Last time something like these discontinuous ranges was discussed as a >> general feature in Emacs, we couldn't come up with an agreed-upon >> design and implementation. So adding something like that to Emacs is >> not an easy job. > We can provide both. Those who needs the more powerful ranges could use that, and those who don’t can use narrowing. If one wanted to continue where the previous discussions stopped, we tentatively decided that the variable prog-indentation-context could help. I.e. when some multiple-major-mode framework wanted to tell the current major mode that there are more "ranges" of the same mode in the buffer, it would bind prog-indentation-context to some particular value. It's very much "to be discussed later", but the second element of prog-indentation-context can be a list of those ranges, or, more likely, a functions that produces that list. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-26 17:09 ` Yuan Fu 2021-07-26 18:55 ` Eli Zaretskii @ 2021-07-27 6:13 ` Stephen Leake 2021-07-27 14:56 ` Yuan Fu 1 sibling, 1 reply; 370+ messages in thread From: Stephen Leake @ 2021-07-27 6:13 UTC (permalink / raw) To: Yuan Fu; +Cc: Eli Zaretskii, emacs-devel, cpitclaudel, monnier Yuan Fu <casouri@gmail.com> writes: >>> >>>> Once again, we are talking about the function used by TS to read >>>> buffer text. Not about the parser or its caller. Low-level code, >>>> which knows nothing about the context, should never look beyond the >>>> restriction. >>> >>> It doesn’t harm for tree-sitter to see the rest of the buffer, it >>> doesn’t modify anything, all it does it reading the text. OTOH, >>> restricting tree-sitter to the bounds of narrows adds complexity >>> for no benefit (as far as I can see). >> >> Which complexity does it add? You just compare with BEGV_BYTE instead >> of BEG_BYTE etc. > > We need to “delete” the hidden text and “re-insert” when we widen the > buffer. I’ll try to make it a no-op as long as we remember to widen > before calling tree-sitter to parse anything. First, the only thing TS deletes is tree nodes, not text; it does not have a copy of the buffer. Why do you think we need to delete the tree nodes corresponding to the hidden text? They provide exactly the context needed to parse the visible text properly. This assumes the narrowing is temporary, not for a multi-major-mode. -- -- Stephe ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-27 6:13 ` Stephen Leake @ 2021-07-27 14:56 ` Yuan Fu 2021-07-28 3:40 ` Stephen Leake 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-07-27 14:56 UTC (permalink / raw) To: Stephen Leake Cc: Eli Zaretskii, Clément Pit-Claudel, monnier, emacs-devel >> >> We need to “delete” the hidden text and “re-insert” when we widen the >> buffer. I’ll try to make it a no-op as long as we remember to widen >> before calling tree-sitter to parse anything. > > First, the only thing TS deletes is tree nodes, not text; it does not > have a copy of the buffer. > > Why do you think we need to delete the tree nodes corresponding to the > hidden text? They provide exactly the context needed to parse the > visible text properly. I don’t think we need to, but I assume that tree-sitter will delete the corresponding nodes if we hide the text from it. For us, the text is there, just hidden; for tree-sitter, the text is deleted. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-27 14:56 ` Yuan Fu @ 2021-07-28 3:40 ` Stephen Leake 2021-07-28 16:36 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Stephen Leake @ 2021-07-28 3:40 UTC (permalink / raw) To: Yuan Fu; +Cc: Eli Zaretskii, Clément Pit-Claudel, monnier, emacs-devel Yuan Fu <casouri@gmail.com> writes: >>> >>> We need to “delete” the hidden text and “re-insert” when we widen the >>> buffer. I’ll try to make it a no-op as long as we remember to widen >>> before calling tree-sitter to parse anything. >> >> First, the only thing TS deletes is tree nodes, not text; it does not >> have a copy of the buffer. >> >> Why do you think we need to delete the tree nodes corresponding to the >> hidden text? They provide exactly the context needed to parse the >> visible text properly. > > I don’t think we need to, but I assume that tree-sitter will delete > the corresponding nodes if we hide the text from it. No, tree-sitter only deletes nodes that cover changes. So don't send a change that deletes the hidden text; just send changes in the visible part of the text (that's the only place the user can make changes). tree-sitter will only run the scanner on the change regions, so it will only request text from the visible part of the buffer; all the requests will succeed. > For us, the text is there, just hidden; for tree-sitter, the text is > deleted. No, it simply won't notice that it can't access that part of the buffer, because it will never try. What, exactly, will the buffer-text fetch code do if tree-sitter violates the narrowing (by some error in tree-sitter or user code)? throw an exception? return a null string? -- -- Stephe ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-28 3:40 ` Stephen Leake @ 2021-07-28 16:36 ` Yuan Fu 2021-07-28 16:41 ` Eli Zaretskii 2021-07-28 16:43 ` Eli Zaretskii 0 siblings, 2 replies; 370+ messages in thread From: Yuan Fu @ 2021-07-28 16:36 UTC (permalink / raw) To: Stephen Leake Cc: Eli Zaretskii, Clément Pit-Claudel, monnier, emacs-devel > On Jul 27, 2021, at 11:40 PM, Stephen Leake <stephen_leake@stephe-leake.org> wrote: > > Yuan Fu <casouri@gmail.com> writes: > >>>> >>>> We need to “delete” the hidden text and “re-insert” when we widen the >>>> buffer. I’ll try to make it a no-op as long as we remember to widen >>>> before calling tree-sitter to parse anything. >>> >>> First, the only thing TS deletes is tree nodes, not text; it does not >>> have a copy of the buffer. >>> >>> Why do you think we need to delete the tree nodes corresponding to the >>> hidden text? They provide exactly the context needed to parse the >>> visible text properly. >> >> I don’t think we need to, but I assume that tree-sitter will delete >> the corresponding nodes if we hide the text from it. > > No, tree-sitter only deletes nodes that cover changes. > > So don't send a change that deletes the hidden text; just send changes > in the visible part of the text (that's the only place the user can make > changes). tree-sitter will only run the scanner on the change regions, > so it will only request text from the visible part of the buffer; > all the requests will succeed. Then we are not hiding the hidden text from tree-sitter. The implementation you described, IIUC, is essentially do nothing special when the buffer is narrowed. > >> For us, the text is there, just hidden; for tree-sitter, the text is >> deleted. > > No, it simply won't notice that it can't access that part of the buffer, > because it will never try. > > What, exactly, will the buffer-text fetch code do if tree-sitter > violates the narrowing (by some error in tree-sitter or user code)? > throw an exception? return a null string? In my current implementation, if tree-sitter access buffer content outside of narrowed region, it reads whitespaces, if it access buffer content outside of the buffer, it reads null string. Neither case throws an error. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-28 16:36 ` Yuan Fu @ 2021-07-28 16:41 ` Eli Zaretskii 2021-07-29 22:58 ` Stephen Leake 2021-07-28 16:43 ` Eli Zaretskii 1 sibling, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-28 16:41 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Wed, 28 Jul 2021 12:36:33 -0400 > Cc: Eli Zaretskii <eliz@gnu.org>, > emacs-devel <emacs-devel@gnu.org>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > monnier@iro.umontreal.ca > > > So don't send a change that deletes the hidden text; just send changes > > in the visible part of the text (that's the only place the user can make > > changes). tree-sitter will only run the scanner on the change regions, > > so it will only request text from the visible part of the buffer; > > all the requests will succeed. > > Then we are not hiding the hidden text from tree-sitter. The implementation you described, IIUC, is essentially do nothing special when the buffer is narrowed. If the TS parser is called while the narrowing is in effect, it will be unable to access text beyond BEGV..ZV. So in that case the narrowing _will_ affect TS. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-28 16:41 ` Eli Zaretskii @ 2021-07-29 22:58 ` Stephen Leake 2021-07-30 6:00 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Stephen Leake @ 2021-07-29 22:58 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Yuan Fu, cpitclaudel, monnier, emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> From: Yuan Fu <casouri@gmail.com> >> Date: Wed, 28 Jul 2021 12:36:33 -0400 >> Cc: Eli Zaretskii <eliz@gnu.org>, >> emacs-devel <emacs-devel@gnu.org>, >> Clément Pit-Claudel <cpitclaudel@gmail.com>, >> monnier@iro.umontreal.ca >> >> > So don't send a change that deletes the hidden text; just send changes >> > in the visible part of the text (that's the only place the user can make >> > changes). tree-sitter will only run the scanner on the change regions, >> > so it will only request text from the visible part of the buffer; >> > all the requests will succeed. >> >> Then we are not hiding the hidden text from tree-sitter. The >> implementation you described, IIUC, is essentially do nothing >> special when the buffer is narrowed. > > If the TS parser is called while the narrowing is in effect, it will > be unable to access text beyond BEGV..ZV. So in that case the > narrowing _will_ affect TS. Please read again; TS is affected in principle, but in practice, in the absence of programming errors, it will never try to access text outside the narrowing, so it won't notice. -- -- Stephe ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-29 22:58 ` Stephen Leake @ 2021-07-30 6:00 ` Eli Zaretskii 0 siblings, 0 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-07-30 6:00 UTC (permalink / raw) To: Stephen Leake; +Cc: casouri, cpitclaudel, monnier, emacs-devel > From: Stephen Leake <stephen_leake@stephe-leake.org> > Cc: Yuan Fu <casouri@gmail.com>, emacs-devel@gnu.org, > cpitclaudel@gmail.com, monnier@iro.umontreal.ca > Date: Thu, 29 Jul 2021 15:58:39 -0700 > > Eli Zaretskii <eliz@gnu.org> writes: > > >> From: Yuan Fu <casouri@gmail.com> > >> Date: Wed, 28 Jul 2021 12:36:33 -0400 > >> Cc: Eli Zaretskii <eliz@gnu.org>, > >> emacs-devel <emacs-devel@gnu.org>, > >> Clément Pit-Claudel <cpitclaudel@gmail.com>, > >> monnier@iro.umontreal.ca > >> > >> > So don't send a change that deletes the hidden text; just send changes > >> > in the visible part of the text (that's the only place the user can make > >> > changes). tree-sitter will only run the scanner on the change regions, > >> > so it will only request text from the visible part of the buffer; > >> > all the requests will succeed. > >> > >> Then we are not hiding the hidden text from tree-sitter. The > >> implementation you described, IIUC, is essentially do nothing > >> special when the buffer is narrowed. > > > > If the TS parser is called while the narrowing is in effect, it will > > be unable to access text beyond BEGV..ZV. So in that case the > > narrowing _will_ affect TS. > > Please read again; TS is affected in principle, but in practice, in the > absence of programming errors, it will never try to access text outside > the narrowing, so it won't notice. Sorry, I don't understand what you wanted me to re-read. As the subsequent discussions revealed, Yuan had in mind a scenario where the text outside of the restriction was changed. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-28 16:36 ` Yuan Fu 2021-07-28 16:41 ` Eli Zaretskii @ 2021-07-28 16:43 ` Eli Zaretskii 2021-07-28 17:47 ` Yuan Fu 1 sibling, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-28 16:43 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Wed, 28 Jul 2021 12:36:33 -0400 > Cc: Eli Zaretskii <eliz@gnu.org>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > monnier@iro.umontreal.ca, emacs-devel <emacs-devel@gnu.org> > > > What, exactly, will the buffer-text fetch code do if tree-sitter > > violates the narrowing (by some error in tree-sitter or user code)? > > throw an exception? return a null string? > > In my current implementation, if tree-sitter access buffer content outside of narrowed region, it reads whitespaces, if it access buffer content outside of the buffer, it reads null string. Neither case throws an error. What does TS expect the reader function to return when it hits the beginning or end of buffer text? I think we should behave the same when it tries to go beyond the accessible portion. There should be no difference between going beyond the restriction and going beyond EOB. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-28 16:43 ` Eli Zaretskii @ 2021-07-28 17:47 ` Yuan Fu 2021-07-28 17:54 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-07-28 17:47 UTC (permalink / raw) To: Eli Zaretskii; +Cc: cpitclaudel, Stephen Leake, monnier, emacs-devel > On Jul 28, 2021, at 12:43 PM, Eli Zaretskii <eliz@gnu.org> wrote: > >> From: Yuan Fu <casouri@gmail.com> >> Date: Wed, 28 Jul 2021 12:36:33 -0400 >> Cc: Eli Zaretskii <eliz@gnu.org>, >> Clément Pit-Claudel <cpitclaudel@gmail.com>, >> monnier@iro.umontreal.ca, emacs-devel <emacs-devel@gnu.org> >> >>> What, exactly, will the buffer-text fetch code do if tree-sitter >>> violates the narrowing (by some error in tree-sitter or user code)? >>> throw an exception? return a null string? >> >> In my current implementation, if tree-sitter access buffer content outside of narrowed region, it reads whitespaces, if it access buffer content outside of the buffer, it reads null string. Neither case throws an error. > > What does TS expect the reader function to return when it hits the > beginning or end of buffer text? I think we should behave the same > when it tries to go beyond the accessible portion. There should be no > difference between going beyond the restriction and going beyond EOB. > It expect the read function set *read_bytes to 0 when it reached the end of the buffer. Tree-sitter never “hit the beginning of the buffer text” because it doesn’t read backward. I’m pretty sure tree-sitter expects to always be able to read from BOB. Could you describe the desired effect on tree-sitter when the buffer is narrowed? If we just deny accessibility of the hidden region from tree-sitter, tree-sitter is still aware of the hidden text, because it has previously parsed the hidden text and stored the result in the parse tree. My current implementation is to “replace” the hidden region with whitespaces. When the buffer is narrowed and tree-sitter is asked to re-parse (by some user command), I tell tree-sitter that the hidden portion of the buffer has changed, then during the re-parse, tree-sitter will re-scan those parts, and reads whitespaces. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-28 17:47 ` Yuan Fu @ 2021-07-28 17:54 ` Eli Zaretskii 2021-07-28 18:46 ` Yuan Fu 2021-07-29 23:01 ` Stephen Leake 0 siblings, 2 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-07-28 17:54 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Wed, 28 Jul 2021 13:47:42 -0400 > Cc: Stephen Leake <stephen_leake@stephe-leake.org>, > cpitclaudel@gmail.com, > monnier@iro.umontreal.ca, > emacs-devel@gnu.org > > Could you describe the desired effect on tree-sitter when the buffer is narrowed? The behavior should be the same as if the text before and after the narrowed region didn't exist. If we just deny accessibility of the hidden region from tree-sitter, tree-sitter is still aware of the hidden text, because it has previously parsed the hidden text and stored the result in the parse tree. The adherence to narrowing is for the use cases where TS is _always_ invoked on the same narrowed region. You seem to be thinking about changes in the narrowing while TS is parsing, or between consecutive re-parsing calls, but I see no interesting/important use cases which would need to do that. And if there are some tricky cases which do need this, the respective Lisp programs will have to deal with the problem. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-28 17:54 ` Eli Zaretskii @ 2021-07-28 18:46 ` Yuan Fu 2021-07-28 19:00 ` Eli Zaretskii ` (2 more replies) 2021-07-29 23:01 ` Stephen Leake 1 sibling, 3 replies; 370+ messages in thread From: Yuan Fu @ 2021-07-28 18:46 UTC (permalink / raw) To: Eli Zaretskii; +Cc: cpitclaudel, Stephen Leake, monnier, emacs-devel > On Jul 28, 2021, at 1:54 PM, Eli Zaretskii <eliz@gnu.org> wrote: > >> From: Yuan Fu <casouri@gmail.com> >> Date: Wed, 28 Jul 2021 13:47:42 -0400 >> Cc: Stephen Leake <stephen_leake@stephe-leake.org>, >> cpitclaudel@gmail.com, >> monnier@iro.umontreal.ca, >> emacs-devel@gnu.org >> >> Could you describe the desired effect on tree-sitter when the buffer is narrowed? > > The behavior should be the same as if the text before and after the > narrowed region didn't exist. > > If we just deny accessibility of the hidden region from tree-sitter, tree-sitter is still aware of the hidden text, because it has previously parsed the hidden text and stored the result in the parse tree. > > The adherence to narrowing is for the use cases where TS is _always_ > invoked on the same narrowed region. You seem to be thinking about > changes in the narrowing while TS is parsing, or between consecutive > re-parsing calls, but I see no interesting/important use cases which > would need to do that. And if there are some tricky cases which do > need this, the respective Lisp programs will have to deal with the > problem. That makes sense. However it bring up a problem. Consider such a buffer: XXAAXX. Say lisp narrows to AA and creates a tree-sitter parser. Then lisp widens the buffer, and user inserts B in front of AA. Now the buffer is XXBAAXX. Emacs has two options to convey this change to the tree-sitter parser: 1) it does not, then tree-sitter still thinks the buffer is AA, essentially the portion where tree-sitter sees is pushed forward by one character, 2) it tells tree-sitter the user inserted a character at the beginning, then tree-sitter thinks the buffer is BAA. Which option is correct depends on how does lisp later narrows: if lisp narrows to AA, then option 1 is correct, if lisp narrows to BAA, then option 2 is correct. But how do we know which option is correct before lisp narrows? Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-28 18:46 ` Yuan Fu @ 2021-07-28 19:00 ` Eli Zaretskii 2021-07-29 14:35 ` Yuan Fu 2021-07-29 23:06 ` How to add pseudo vector types Stephen Leake 2021-07-30 0:35 ` Richard Stallman 2 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-28 19:00 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Wed, 28 Jul 2021 14:46:03 -0400 > Cc: Stephen Leake <stephen_leake@stephe-leake.org>, > cpitclaudel@gmail.com, > monnier@iro.umontreal.ca, > emacs-devel@gnu.org > > > The adherence to narrowing is for the use cases where TS is _always_ > > invoked on the same narrowed region. You seem to be thinking about > > changes in the narrowing while TS is parsing, or between consecutive > > re-parsing calls, but I see no interesting/important use cases which > > would need to do that. And if there are some tricky cases which do > > need this, the respective Lisp programs will have to deal with the > > problem. > > That makes sense. However it bring up a problem. Consider such a buffer: XXAAXX. Say lisp narrows to AA and creates a tree-sitter parser. Then lisp widens the buffer, and user inserts B in front of AA. Now the buffer is XXBAAXX. Emacs has two options to convey this change to the tree-sitter parser: 1) it does not, then tree-sitter still thinks the buffer is AA, essentially the portion where tree-sitter sees is pushed forward by one character, 2) it tells tree-sitter the user inserted a character at the beginning, then tree-sitter thinks the buffer is BAA. Which option is correct depends on how does lisp later narrows: if lisp narrows to AA, then option 1 is correct, if lisp narrows to BAA, then option 2 is correct. But how do we know which option is correct before lisp narrows? We don't need to know. The Lisp program which needs to handle this situation will have to figure out what is right in that case, "right" in the sense that it produces the desired results after communicating the changes to TS. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-28 19:00 ` Eli Zaretskii @ 2021-07-29 14:35 ` Yuan Fu 2021-07-29 15:28 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-07-29 14:35 UTC (permalink / raw) To: Eli Zaretskii Cc: Clément Pit-Claudel, Stephen Leake, Stefan Monnier, emacs-devel [-- Attachment #1: Type: text/plain, Size: 1728 bytes --] >> >> That makes sense. However it bring up a problem. Consider such a buffer: XXAAXX. Say lisp narrows to AA and creates a tree-sitter parser. Then lisp widens the buffer, and user inserts B in front of AA. Now the buffer is XXBAAXX. Emacs has two options to convey this change to the tree-sitter parser: 1) it does not, then tree-sitter still thinks the buffer is AA, essentially the portion where tree-sitter sees is pushed forward by one character, 2) it tells tree-sitter the user inserted a character at the beginning, then tree-sitter thinks the buffer is BAA. Which option is correct depends on how does lisp later narrows: if lisp narrows to AA, then option 1 is correct, if lisp narrows to BAA, then option 2 is correct. But how do we know which option is correct before lisp narrows? > > We don't need to know. The Lisp program which needs to handle this > situation will have to figure out what is right in that case, "right" > in the sense that it produces the desired results after communicating > the changes to TS. The difficulty is that what tree-sitter sees must be consistent. If Emacs updates tree-sitter with option 1 and lisp later choose option 2, the content that tree-sitter sees is not consistent. Anyway, I found a way that avoids this issue: the bounds of tree-sitter’s visible region never changes, and the next time when lisp narrows to a different region, we update tree-sitter’s bound to match that of the narrowing. Here is the latest patch. If the code is not entirely straightforward, I’m happy to add more comment to explain it. I set up a linux machine and tried to debug the crashing problem, but it didn’t crash. Seems the crash only appears on my Mac... Yuan [-- Attachment #2: ts.5.patch --] [-- Type: application/octet-stream, Size: 23721 bytes --] From 62fc019a7f57119329d53b9b8a3e8b5c1e61b27f Mon Sep 17 00:00:00 2001 From: Yuan Fu <casouri@gmail.com> Date: Wed, 28 Jul 2021 21:08:43 -0400 Subject: [PATCH] checkpoint 5 - Move define_error out of json.c - Add narrowing support --- lisp/tree-sitter.el | 11 +- src/eval.c | 13 ++ src/json.c | 16 --- src/lisp.h | 5 + src/tree_sitter.c | 231 +++++++++++++++++++++++----------- src/tree_sitter.h | 15 ++- test/src/tree-sitter-tests.el | 53 ++++++++ 7 files changed, 251 insertions(+), 93 deletions(-) diff --git a/lisp/tree-sitter.el b/lisp/tree-sitter.el index a6ecb09386..8a887bb406 100644 --- a/lisp/tree-sitter.el +++ b/lisp/tree-sitter.el @@ -102,12 +102,13 @@ tree-sitter-font-lock-settings PATTERN is a tree-sitter query pattern. (See manual for how to write query patterns.) This pattern should capture nodes with -either face names or function names. If captured with a face -name, the node's corresponding text in the buffer is fontified -with that face; if captured with a function name, the function is -called with three arguments, BEG END NODE, where BEG and END +either face symbols or function symbols. If captured with a face +symbol, the node's corresponding text in the buffer is fontified +with that face; if captured with a function symbol, the function +is called with three arguments, BEG END NODE, where BEG and END marks the span of the corresponding text, and NODE is the node -itself.") +itself. If a symbol is both a face and a function, it is treated +as a face.") (defun tree-sitter-fontify-region-function (beg end &optional verbose) "Fontify the region between BEG and END. diff --git a/src/eval.c b/src/eval.c index 18faa0b9b1..33c0763f38 100644 --- a/src/eval.c +++ b/src/eval.c @@ -1956,6 +1956,19 @@ signal_error (const char *s, Lisp_Object arg) xsignal (Qerror, Fcons (build_string (s), arg)); } +void +define_error (Lisp_Object name, const char *message, Lisp_Object parent) +{ + eassert (SYMBOLP (name)); + eassert (SYMBOLP (parent)); + Lisp_Object parent_conditions = Fget (parent, Qerror_conditions); + eassert (CONSP (parent_conditions)); + eassert (!NILP (Fmemq (parent, parent_conditions))); + eassert (NILP (Fmemq (name, parent_conditions))); + Fput (name, Qerror_conditions, pure_cons (name, parent_conditions)); + Fput (name, Qerror_message, build_pure_c_string (message)); +} + /* Use this for arithmetic overflow, e.g., when an integer result is too large even for a bignum. */ void diff --git a/src/json.c b/src/json.c index 3f1d27ad7f..ff28143a3c 100644 --- a/src/json.c +++ b/src/json.c @@ -1098,22 +1098,6 @@ DEFUN ("json-parse-buffer", Fjson_parse_buffer, Sjson_parse_buffer, return unbind_to (count, lisp); } -/* Simplified version of 'define-error' that works with pure - objects. */ - -static void -define_error (Lisp_Object name, const char *message, Lisp_Object parent) -{ - eassert (SYMBOLP (name)); - eassert (SYMBOLP (parent)); - Lisp_Object parent_conditions = Fget (parent, Qerror_conditions); - eassert (CONSP (parent_conditions)); - eassert (!NILP (Fmemq (parent, parent_conditions))); - eassert (NILP (Fmemq (name, parent_conditions))); - Fput (name, Qerror_conditions, pure_cons (name, parent_conditions)); - Fput (name, Qerror_message, build_pure_c_string (message)); -} - void syms_of_json (void) { diff --git a/src/lisp.h b/src/lisp.h index e439447283..d30509b61a 100644 --- a/src/lisp.h +++ b/src/lisp.h @@ -5127,6 +5127,11 @@ maybe_gc (void) maybe_garbage_collect (); } +/* Simplified version of 'define-error' that works with pure + objects. */ +void +define_error (Lisp_Object name, const char *message, Lisp_Object parent); + INLINE_HEADER_END #endif /* EMACS_LISP_H */ diff --git a/src/tree_sitter.c b/src/tree_sitter.c index e9f8ddc7e3..5e16df7758 100644 --- a/src/tree_sitter.c +++ b/src/tree_sitter.c @@ -19,17 +19,8 @@ Copyright (C) 2021 Free Software Foundation, Inc. #include <config.h> -#include <sys/types.h> -#include <sys/stat.h> -#include <sys/param.h> -#include <errno.h> -#include <stdio.h> -#include <stdlib.h> -#include <unistd.h> - #include "lisp.h" #include "buffer.h" -#include "coding.h" #include "tree_sitter.h" /* parser.h defines a macro ADVANCE that conflicts with alloc.c. */ @@ -61,6 +52,16 @@ DEFUN ("tree-sitter-node-p", /*** Parsing functions */ +static inline void +ts_tree_edit_1 (TSTree *tree, ptrdiff_t start_byte, + ptrdiff_t old_end_byte, ptrdiff_t new_end_byte) +{ + TSPoint dummy_point = {0, 0}; + TSInputEdit edit = {start_byte, old_end_byte, new_end_byte, + dummy_point, dummy_point, dummy_point}; + ts_tree_edit (tree, &edit); +} + /* Update each parser's tree after the user made an edit. This function does not parse the buffer and only updates the tree. (So it should be very fast.) */ @@ -68,18 +69,38 @@ DEFUN ("tree-sitter-node-p", ts_record_change (ptrdiff_t start_byte, ptrdiff_t old_end_byte, ptrdiff_t new_end_byte) { + eassert(start_byte <= old_end_byte); + eassert(start_byte <= new_end_byte); + Lisp_Object parser_list = Fsymbol_value (Qtree_sitter_parser_list); - TSPoint dummy_point = {0, 0}; - TSInputEdit edit = {start_byte, old_end_byte, new_end_byte, - dummy_point, dummy_point, dummy_point}; + while (!NILP (parser_list)) { Lisp_Object lisp_parser = Fcar (parser_list); TSTree *tree = XTS_PARSER (lisp_parser)->tree; if (tree != NULL) - ts_tree_edit (tree, &edit); - XTS_PARSER (lisp_parser)->need_reparse = true; - parser_list = Fcdr (parser_list); + { + /* We "clip" the change to between visible_beg and + visible_end. It is okay if visible_end ends up larger + than BUF_Z, tree-sitter only access buffer text during + re-parse, and we will adjust visible_beg/end before + re-parse. */ + ptrdiff_t visible_beg = XTS_PARSER (lisp_parser)->visible_beg; + ptrdiff_t visible_end = XTS_PARSER (lisp_parser)->visible_end; + + ptrdiff_t visible_start = + max (visible_beg, start_byte) - visible_beg; + ptrdiff_t visible_old_end = + min (visible_end, old_end_byte) - visible_beg; + ptrdiff_t visible_new_end = + min (visible_end, new_end_byte) - visible_beg; + + ts_tree_edit_1 (tree, visible_start, visible_old_end, + visible_new_end); + XTS_PARSER (lisp_parser)->need_reparse = true; + + parser_list = Fcdr (parser_list); + } } } @@ -93,16 +114,67 @@ ts_ensure_parsed (Lisp_Object parser) TSParser *ts_parser = XTS_PARSER (parser)->parser; TSTree *tree = XTS_PARSER(parser)->tree; TSInput input = XTS_PARSER (parser)->input; + struct buffer *buffer = XTS_PARSER (parser)->buffer; + + /* Before we parse, catch up with the narrowing situation. We + change visible_beg and visible_end to match BUF_BEGV_BYTE and + BUF_ZV_BYTE, and inform tree-sitter of the change. */ + ptrdiff_t visible_beg = XTS_PARSER (parser)->visible_beg; + ptrdiff_t visible_end = XTS_PARSER (parser)->visible_end; + /* Before re-parse, we want to move the visible range of tree-sitter + to matched the narrowed range. For example: + Move ________|____|__ + to |____|__________ */ + + /* 1. Make sure visible_beg <= BUF_BEGV_BYTE. */ + if (visible_beg > BUF_BEGV_BYTE (buffer)) + { + /* Tree-sitter sees: insert at the beginning. */ + ts_tree_edit_1 (tree, 0, 0, visible_beg - BUF_BEGV_BYTE (buffer)); + visible_beg = BUF_BEGV_BYTE (buffer); + } + /* 2. Make sure visible_end = BUF_ZV_BYTE. */ + if (visible_end < BUF_ZV_BYTE (buffer)) + { + /* Tree-sitter sees: insert at the end. */ + ts_tree_edit_1 (tree, visible_end - visible_beg, + visible_end - visible_beg, + BUF_ZV_BYTE (buffer) - visible_beg); + visible_end = BUF_ZV_BYTE (buffer); + } + else if (visible_end > BUF_ZV_BYTE (buffer)) + { + /* Tree-sitter sees: delete at the end. */ + ts_tree_edit_1 (tree, BUF_ZV_BYTE (buffer) - visible_beg, + visible_end - visible_beg, + BUF_ZV_BYTE (buffer) - visible_beg); + visible_end = BUF_ZV_BYTE (buffer); + } + /* 3. Make sure visible_beg = BUF_BEGV_BYTE. */ + if (visible_beg < BUF_BEGV_BYTE (buffer)) + { + /* Tree-sitter sees: delete at the beginning. */ + ts_tree_edit_1 (tree, 0, BUF_BEGV_BYTE (buffer) - visible_beg, 0); + visible_beg = BUF_BEGV_BYTE (buffer); + } + XTS_PARSER (parser)->visible_beg = visible_beg; + XTS_PARSER (parser)->visible_end = visible_end; + TSTree *new_tree = ts_parser_parse(ts_parser, tree, input); - /* This should be very rare: it only happens when 1) language is not - set (impossible in Emacs because the user has to supply a - language to create a parser), 2) parse canceled due to timeout - (impossible because we don't set a timeout), 3) parse canceled - due to cancellation flag (impossible because we don't set the - flag). (See comments for ts_parser_parse in + /* This should be very rare (impossible, really): it only happens + when 1) language is not set (impossible in Emacs because the user + has to supply a language to create a parser), 2) parse canceled + due to timeout (impossible because we don't set a timeout), 3) + parse canceled due to cancellation flag (impossible because we + don't set the flag). (See comments for ts_parser_parse in tree_sitter/api.h.) */ if (new_tree == NULL) - signal_error ("Parse failed", parser); + { + Lisp_Object buf; + XSETBUFFER(buf, buffer); + xsignal1 (Qtree_sitter_parse_error, buf); + } + ts_tree_delete (tree); XTS_PARSER (parser)->tree = new_tree; XTS_PARSER (parser)->need_reparse = false; @@ -110,13 +182,18 @@ ts_ensure_parsed (Lisp_Object parser) } /* This is the read function provided to tree-sitter to read from a - buffer. It reads one character at a time and automatically skip + buffer. It reads one character at a time and automatically skips the gap. */ const char* -ts_read_buffer (void *buffer, uint32_t byte_index, +ts_read_buffer (void *parser, uint32_t byte_index, TSPoint position, uint32_t *bytes_read) { - ptrdiff_t byte_pos = byte_index + 1; + struct buffer *buffer = ((struct Lisp_TS_Parser *) parser)->buffer; + ptrdiff_t visible_beg = ((struct Lisp_TS_Parser *) parser)->visible_beg; + ptrdiff_t byte_pos = byte_index + visible_beg; + /* We will make sure visible_beg >= BUF_BEG_BYTE before re-parse (in + ts_ensure_parsed), so byte_pos will never be smaller than + BUF_BEG_BYTE (unless byte_index < 0). */ /* Read one character. Tree-sitter wants us to set bytes_read to 0 if it reads to the end of buffer. It doesn't say what it wants @@ -126,26 +203,26 @@ ts_read_buffer (void *buffer, uint32_t byte_index, int len; /* This function could run from a user command, so it is better to do nothing instead of raising an error. (It was a pain in the a** - to read mega-if-conditions in Emacs source, so I write the two - branches separately, hoping the compiler can merge them.) */ - if (!BUFFER_LIVE_P ((struct buffer *) buffer)) + to decrypt mega-if-conditions in Emacs source, so I wrote the two + branches separately.) */ + if (!BUFFER_LIVE_P (buffer)) { beg = ""; len = 0; } - // TODO BUF_ZV_BYTE? - else if (byte_pos >= BUF_Z_BYTE ((struct buffer *) buffer)) + /* Reached visible end-of-buffer, tell tree-sitter to read no more. */ + else if (byte_pos >= BUF_ZV_BYTE (buffer)) { beg = ""; len = 0; } + /* Normal case, read a character. */ else { beg = (char *) BUF_BYTE_ADDRESS (buffer, byte_pos); - len = BYTES_BY_CHAR_HEAD ((int) beg); + len = BYTES_BY_CHAR_HEAD ((int) *beg); } *bytes_read = (uint32_t) len; - return beg; } @@ -158,13 +235,16 @@ make_ts_parser (struct buffer *buffer, TSParser *parser, { struct Lisp_TS_Parser *lisp_parser = ALLOCATE_PSEUDOVECTOR (struct Lisp_TS_Parser, name, PVEC_TS_PARSER); + lisp_parser->name = name; lisp_parser->buffer = buffer; lisp_parser->parser = parser; lisp_parser->tree = tree; - TSInput input = {buffer, ts_read_buffer, TSInputEncodingUTF8}; + TSInput input = {lisp_parser, ts_read_buffer, TSInputEncodingUTF8}; lisp_parser->input = input; lisp_parser->need_reparse = true; + lisp_parser->visible_beg = BUF_BEGV (buffer); + lisp_parser->visible_end = BUF_ZV (buffer); return make_lisp_ptr (lisp_parser, Lisp_Vectorlike); } @@ -287,7 +367,7 @@ DEFUN ("tree-sitter-parse-string", /* See comment in ts_ensure_parsed for possible reasons for a failure. */ if (tree == NULL) - signal_error ("Failed to parse STRING", string); + xsignal1 (Qtree_sitter_parse_error, string); TSNode root_node = ts_tree_root_node (tree); @@ -535,7 +615,9 @@ DEFUN ("tree-sitter-node-first-child-for-byte", { CHECK_INTEGER (pos); - struct buffer *buf = (XTS_PARSER (XTS_NODE (node)->parser)->buffer); + struct buffer *buf = XTS_PARSER (XTS_NODE (node)->parser)->buffer; + ptrdiff_t visible_beg = + XTS_PARSER (XTS_NODE (node)->parser)->visible_beg; ptrdiff_t byte_pos = XFIXNUM (pos); if (byte_pos < BUF_BEGV_BYTE (buf) || byte_pos > BUF_ZV_BYTE (buf)) @@ -544,9 +626,10 @@ DEFUN ("tree-sitter-node-first-child-for-byte", TSNode ts_node = XTS_NODE (node)->node; TSNode child; if (NILP (named)) - child = ts_node_first_child_for_byte (ts_node, byte_pos - 1); + child = ts_node_first_child_for_byte (ts_node, byte_pos - visible_beg); else - child = ts_node_first_named_child_for_byte (ts_node, byte_pos - 1); + child = ts_node_first_named_child_for_byte + (ts_node, byte_pos - visible_beg); if (ts_node_is_null(child)) return Qnil; @@ -566,7 +649,9 @@ DEFUN ("tree-sitter-node-descendant-for-byte-range", CHECK_INTEGER (beg); CHECK_INTEGER (end); - struct buffer *buf = (XTS_PARSER (XTS_NODE (node)->parser)->buffer); + struct buffer *buf = XTS_PARSER (XTS_NODE (node)->parser)->buffer; + ptrdiff_t visible_beg = + XTS_PARSER (XTS_NODE (node)->parser)->visible_beg; ptrdiff_t byte_beg = XFIXNUM (beg); ptrdiff_t byte_end = XFIXNUM (end); @@ -580,10 +665,10 @@ DEFUN ("tree-sitter-node-descendant-for-byte-range", TSNode child; if (NILP (named)) child = ts_node_descendant_for_byte_range - (ts_node, byte_beg - 1 , byte_end - 1); + (ts_node, byte_beg - visible_beg , byte_end - visible_beg); else child = ts_node_named_descendant_for_byte_range - (ts_node, byte_beg - 1, byte_end - 1); + (ts_node, byte_beg - visible_beg, byte_end - visible_beg); if (ts_node_is_null(child)) return Qnil; @@ -593,31 +678,24 @@ DEFUN ("tree-sitter-node-descendant-for-byte-range", /* Query functions */ -Lisp_Object ts_query_error_to_string (TSQueryError error) +char* +ts_query_error_to_string (TSQueryError error) { - char *error_name; switch (error) { case TSQueryErrorNone: - error_name = "none"; - break; + return "none"; case TSQueryErrorSyntax: - error_name = "syntax"; - break; + return "syntax"; case TSQueryErrorNodeType: - error_name = "node type"; - break; + return "node type"; case TSQueryErrorField: - error_name = "field"; - break; + return "field"; case TSQueryErrorCapture: - error_name = "capture"; - break; + return "capture"; case TSQueryErrorStructure: - error_name = "structure"; - break; + return "structure"; } - return make_pure_c_string (error_name, strlen(error_name)); } DEFUN ("tree-sitter-query-capture", @@ -634,7 +712,7 @@ DEFUN ("tree-sitter-query-capture", BEG and END, if _both_ non-nil, specifies the range in which the query is executed. -Return nil if the query failed. */) +Raise an tree-sitter-query-error if PATTERN is malformed. */) (Lisp_Object node, Lisp_Object pattern, Lisp_Object beg, Lisp_Object end) { @@ -643,47 +721,56 @@ DEFUN ("tree-sitter-query-capture", TSNode ts_node = XTS_NODE (node)->node; Lisp_Object lisp_parser = XTS_NODE (node)->parser; + ptrdiff_t visible_beg = + XTS_PARSER (XTS_NODE (node)->parser)->visible_beg; const TSLanguage *lang = ts_parser_language (XTS_PARSER (lisp_parser)->parser); char *source = SSDATA (pattern); + uint32_t error_offset; - uint32_t error_type; + TSQueryError error_type; TSQuery *query = ts_query_new (lang, source, strlen (source), &error_offset, &error_type); TSQueryCursor *cursor = ts_query_cursor_new (); if (query == NULL) { - // FIXME: Signal an error? - return Qnil; + // FIXME: Still crashes, debug when I can get a gdb. + xsignal2 (Qtree_sitter_query_error, + make_fixnum (error_offset), + build_string (ts_query_error_to_string (error_type))); } if (!NILP (beg) && !NILP (end)) { EMACS_INT beg_byte = XFIXNUM (beg); EMACS_INT end_byte = XFIXNUM (end); ts_query_cursor_set_byte_range - (cursor, (uint32_t) beg_byte - 1, (uint32_t) end_byte - 1); + (cursor, (uint32_t) beg_byte - visible_beg, + (uint32_t) end_byte - visible_beg); } ts_query_cursor_exec (cursor, query, ts_node); TSQueryMatch match; - TSQueryCapture capture; + Lisp_Object result = Qnil; - Lisp_Object entry; - Lisp_Object captured_node; - const char *capture_name; - uint32_t capture_name_len; while (ts_query_cursor_next_match (cursor, &match)) { const TSQueryCapture *captures = match.captures; for (int idx=0; idx < match.capture_count; idx++) { + TSQueryCapture capture; + Lisp_Object captured_node; + const char *capture_name; + Lisp_Object entry; + uint32_t capture_name_len; + capture = captures[idx]; captured_node = make_ts_node(lisp_parser, capture.node); capture_name = ts_query_capture_name_for_id (query, capture.index, &capture_name_len); - entry = Fcons (intern_c_string (capture_name), + entry = Fcons (intern_c_string_1 + (capture_name, capture_name_len), captured_node); result = Fcons (entry, result); } @@ -705,11 +792,15 @@ syms_of_tree_sitter (void) DEFSYM (Qhas_changes, "has-changes"); DEFSYM (Qhas_error, "has-error"); + DEFSYM(Qtree_sitter_error, "tree-sitter-error"); DEFSYM (Qtree_sitter_query_error, "tree-sitter-query-error"); - Fput (Qtree_sitter_query_error, Qerror_conditions, - pure_list (Qtree_sitter_query_error, Qerror)); - Fput (Qtree_sitter_query_error, Qerror_message, - build_pure_c_string ("Error with query pattern")) + DEFSYM (Qtree_sitter_parse_error, "tree-sitter-parse-error") + define_error (Qtree_sitter_error, "Generic tree-sitter error", Qerror); + define_error (Qtree_sitter_query_error, "Query pattern is malformed", + Qtree_sitter_error); + define_error (Qtree_sitter_parse_error, "Parse failed", + Qtree_sitter_error); + DEFSYM (Qtree_sitter_parser_list, "tree-sitter-parser-list"); DEFVAR_LISP ("tree-sitter-parser-list", Vtree_sitter_parser_list, diff --git a/src/tree_sitter.h b/src/tree_sitter.h index e9b4a71326..7e0fec0ee9 100644 --- a/src/tree_sitter.h +++ b/src/tree_sitter.h @@ -20,8 +20,6 @@ Copyright (C) 2021 Free Software Foundation, Inc. #ifndef EMACS_TREE_SITTER_H #define EMACS_TREE_SITTER_H -#include <sys/types.h> - #include "lisp.h" #include <tree_sitter/api.h> @@ -33,12 +31,25 @@ #define EMACS_TREE_SITTER_H struct Lisp_TS_Parser { union vectorlike_header header; + /* A parser's name is just a convenient tag, see docstring for + 'tree-sitter-make-parser', and 'tree-sitter-get-parser'. */ Lisp_Object name; struct buffer *buffer; TSParser *parser; TSTree *tree; TSInput input; + /* Re-parsing an unchanged buffer is not free for tree-sitter, so we + only make it re-parse when need_reparse == true. That usually + means some change is made in the buffer. But others could set + this field to true to force tree-sitter to re-parse. */ bool need_reparse; + /* This two positions record the byte position of the "visible + region" that tree-sitter sees. Unlike markers, These two + positions do not change as the user inserts and deletes text + around them. Before re-parse, we move these positions to match + BUF_BEGV_BYTE and BUF_ZV_BYTE. */ + ptrdiff_t visible_beg; + ptrdiff_t visible_end; }; /* A wrapper around a tree-sitter node. */ diff --git a/test/src/tree-sitter-tests.el b/test/src/tree-sitter-tests.el index c61ad678d2..69104568de 100644 --- a/test/src/tree-sitter-tests.el +++ b/test/src/tree-sitter-tests.el @@ -148,5 +148,58 @@ tree-sitter-query-api (cdr entry)))) (tree-sitter-query-capture root-node pattern))))))) +(ert-deftest tree-sitter-narrow () + "Tests if narrowing works." + (with-temp-buffer + (let (parser root-node pattern doc-node object-node pair-node) + (progn + (insert "xxx[1,{\"name\": \"Bob\"},2,3]xxx") + (narrow-to-region (+ (point-min) 3) (- (point-max) 3)) + (setq parser (tree-sitter-create-parser + (current-buffer) (tree-sitter-json))) + (setq root-node (tree-sitter-parser-root-node + parser))) + ;; This test is from the basic test. + (should + (equal + (tree-sitter-node-string + (tree-sitter-parser-root-node parser)) + "(document (array (number) (object (pair key: (string (string_content)) value: (string (string_content)))) (number) (number)))")) + + (widen) + (goto-char (point-min)) + (insert "ooo") + (should (equal "oooxxx[1,{\"name\": \"Bob\"},2,3]xxx" + (buffer-string))) + (delete-region 10 26) + (should (equal "oooxxx[1,2,3]xxx" + (buffer-string))) + (narrow-to-region (+ (point-min) 6) (- (point-max) 3)) + ;; This test is also from the basic test. + (should + (equal (tree-sitter-node-string + (tree-sitter-parser-root-node parser)) + "(document (array (number) (number) (number)))")) + (widen) + (goto-char (point-max)) + (insert "[1,2]") + (should (equal "oooxxx[1,2,3]xxx[1,2]" + (buffer-string))) + (narrow-to-region (- (point-max) 5) (point-max)) + (should + (equal (tree-sitter-node-string + (tree-sitter-parser-root-node parser)) + "(document (array (number) (number)))")) + (widen) + (goto-char (point-min)) + (insert "[1]") + (should (equal "[1]oooxxx[1,2,3]xxx[1,2]" + (buffer-string))) + (narrow-to-region (point-min) (+ (point-min) 3)) + (should + (equal (tree-sitter-node-string + (tree-sitter-parser-root-node parser)) + "(document (array (number)))"))))) + (provide 'tree-sitter-tests) ;;; tree-sitter-tests.el ends here -- 2.24.3 (Apple Git-128) ^ permalink raw reply related [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-29 14:35 ` Yuan Fu @ 2021-07-29 15:28 ` Eli Zaretskii 2021-07-29 15:57 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-29 15:28 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Thu, 29 Jul 2021 10:35:10 -0400 > Cc: Stephen Leake <stephen_leake@stephe-leake.org>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Stefan Monnier <monnier@iro.umontreal.ca>, > emacs-devel@gnu.org > > > We don't need to know. The Lisp program which needs to handle this > > situation will have to figure out what is right in that case, "right" > > in the sense that it produces the desired results after communicating > > the changes to TS. > > The difficulty is that what tree-sitter sees must be consistent. If Emacs updates tree-sitter with option 1 and lisp later choose option 2, the content that tree-sitter sees is not consistent. If that happens, it means the Lisp program which does that has a bug that needs to be fixed. > Anyway, I found a way that avoids this issue: the bounds of tree-sitter’s visible region never changes, and the next time when lisp narrows to a different region, we update tree-sitter’s bound to match that of the narrowing. Here is the latest patch. If the code is not entirely straightforward, I’m happy to add more comment to explain it. I'm not sure we should do this, because it means we second-guess what the Lisp program calling TS intends to do. Why should we do that, instead of leaving it to the Lisp program to DTRT? And what happens if our guess is wrong? ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-29 15:28 ` Eli Zaretskii @ 2021-07-29 15:57 ` Yuan Fu 2021-07-29 16:21 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-07-29 15:57 UTC (permalink / raw) To: Eli Zaretskii Cc: Clément Pit-Claudel, Stephen Leake, monnier, emacs-devel > On Jul 29, 2021, at 11:28 AM, Eli Zaretskii <eliz@gnu.org> wrote: > >> From: Yuan Fu <casouri@gmail.com> >> Date: Thu, 29 Jul 2021 10:35:10 -0400 >> Cc: Stephen Leake <stephen_leake@stephe-leake.org>, >> Clément Pit-Claudel <cpitclaudel@gmail.com>, >> Stefan Monnier <monnier@iro.umontreal.ca>, >> emacs-devel@gnu.org >> >>> We don't need to know. The Lisp program which needs to handle this >>> situation will have to figure out what is right in that case, "right" >>> in the sense that it produces the desired results after communicating >>> the changes to TS. >> >> The difficulty is that what tree-sitter sees must be consistent. If Emacs updates tree-sitter with option 1 and lisp later choose option 2, the content that tree-sitter sees is not consistent. > > If that happens, it means the Lisp program which does that has a bug > that needs to be fixed. > >> Anyway, I found a way that avoids this issue: the bounds of tree-sitter’s visible region never changes, and the next time when lisp narrows to a different region, we update tree-sitter’s bound to match that of the narrowing. Here is the latest patch. If the code is not entirely straightforward, I’m happy to add more comment to explain it. > > I'm not sure we should do this, because it means we second-guess what > the Lisp program calling TS intends to do. Why should we do that, > instead of leaving it to the Lisp program to DTRT? And what happens > if our guess is wrong? I don’t think the current implementation guesses anything. Let me turn around and ask you what is TRT: if the buffer is xxxAAAxxx, and lisp narrows to AAA and creates a parser, parser sees AAA; now widen, user inserts BBB in front of AAA, what do we tell tree-sitter? Nothing changed, or BBB inserted at the beginning? To where should lisp narrow? BBBAAA, or AAA, or BBB? Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-29 15:57 ` Yuan Fu @ 2021-07-29 16:21 ` Eli Zaretskii 2021-07-29 16:59 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-29 16:21 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Thu, 29 Jul 2021 11:57:56 -0400 > Cc: Stephen Leake <stephen_leake@stephe-leake.org>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > monnier@iro.umontreal.ca, > emacs-devel@gnu.org > > >> Anyway, I found a way that avoids this issue: the bounds of tree-sitter’s visible region never changes, and the next time when lisp narrows to a different region, we update tree-sitter’s bound to match that of the narrowing. Here is the latest patch. If the code is not entirely straightforward, I’m happy to add more comment to explain it. > > > > I'm not sure we should do this, because it means we second-guess what > > the Lisp program calling TS intends to do. Why should we do that, > > instead of leaving it to the Lisp program to DTRT? And what happens > > if our guess is wrong? > > I don’t think the current implementation guesses anything. Let me turn around and ask you what is TRT: if the buffer is xxxAAAxxx, and lisp narrows to AAA and creates a parser, parser sees AAA; now widen, user inserts BBB in front of AAA, what do we tell tree-sitter? Nothing changed, or BBB inserted at the beginning? Neither. We should tell TS that instead of AAA there's now xxxBBBAAAxxx, because the narrowing was removed. > To where should lisp narrow? BBBAAA, or AAA, or BBB? It's the question for the Lisp program, not for the low-level code which we are discussing. Anyway, you are once again bothered by a scenario that should not happen at all: a Lisp program should not call TS first with, then without narrowing (or the other way around). I don't see why such situation should happen, and if they do, the Lisp programs which need them will have to figure out what to do and how. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-29 16:21 ` Eli Zaretskii @ 2021-07-29 16:59 ` Yuan Fu 2021-07-29 17:38 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-07-29 16:59 UTC (permalink / raw) To: Eli Zaretskii Cc: Clément Pit-Claudel, Stephen Leake, Stefan Monnier, emacs-devel > On Jul 29, 2021, at 12:21 PM, Eli Zaretskii <eliz@gnu.org> wrote: > >> From: Yuan Fu <casouri@gmail.com> >> Date: Thu, 29 Jul 2021 11:57:56 -0400 >> Cc: Stephen Leake <stephen_leake@stephe-leake.org>, >> Clément Pit-Claudel <cpitclaudel@gmail.com>, >> monnier@iro.umontreal.ca, >> emacs-devel@gnu.org >> >>>> Anyway, I found a way that avoids this issue: the bounds of tree-sitter’s visible region never changes, and the next time when lisp narrows to a different region, we update tree-sitter’s bound to match that of the narrowing. Here is the latest patch. If the code is not entirely straightforward, I’m happy to add more comment to explain it. >>> >>> I'm not sure we should do this, because it means we second-guess what >>> the Lisp program calling TS intends to do. Why should we do that, >>> instead of leaving it to the Lisp program to DTRT? And what happens >>> if our guess is wrong? >> >> I don’t think the current implementation guesses anything. Let me turn around and ask you what is TRT: if the buffer is xxxAAAxxx, and lisp narrows to AAA and creates a parser, parser sees AAA; now widen, user inserts BBB in front of AAA, what do we tell tree-sitter? Nothing changed, or BBB inserted at the beginning? > > Neither. We should tell TS that instead of AAA there's now > xxxBBBAAAxxx, because the narrowing was removed. This is the common usage that I imagined: Narrow Calls tree-sitter (for fontification etc) Widen Users edit the buffer narrow Calls tree-sitter (for fontification etc) Widen Ideally, tree-sitter only sees the narrowed region because everytime it is called, the buffer is narrowed. However, tree-sitter doesn’t work that way, it needs to be updated when user edits the buffer, when the buffer is widened. If your goal is give lisp control of what tree-sitter sees, we can’t just give tree-sitter the whole buffer whenever the user makes some change. > >> To where should lisp narrow? BBBAAA, or AAA, or BBB? > > It's the question for the Lisp program, not for the low-level code > which we are discussing. > > Anyway, you are once again bothered by a scenario that should not > happen at all: a Lisp program should not call TS first with, then > without narrowing (or the other way around). I don't see why such > situation should happen, and if they do, the Lisp programs which need > them will have to figure out what to do and how. Even if lisp always call tree-sitter with narrowing, we still need to update tree-sitter when the buffer is widened. Does that make sense? Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-29 16:59 ` Yuan Fu @ 2021-07-29 17:38 ` Eli Zaretskii 2021-07-29 17:55 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-29 17:38 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Thu, 29 Jul 2021 12:59:43 -0400 > Cc: Stephen Leake <stephen_leake@stephe-leake.org>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Stefan Monnier <monnier@iro.umontreal.ca>, > emacs-devel@gnu.org > > >> I don’t think the current implementation guesses anything. Let me turn around and ask you what is TRT: if the buffer is xxxAAAxxx, and lisp narrows to AAA and creates a parser, parser sees AAA; now widen, user inserts BBB in front of AAA, what do we tell tree-sitter? Nothing changed, or BBB inserted at the beginning? > > > > Neither. We should tell TS that instead of AAA there's now > > xxxBBBAAAxxx, because the narrowing was removed. > > This is the common usage that I imagined: > > Narrow > Calls tree-sitter (for fontification etc) > Widen > > Users edit the buffer > > narrow > Calls tree-sitter (for fontification etc) > Widen > > Ideally, tree-sitter only sees the narrowed region because everytime it is called, the buffer is narrowed. However, tree-sitter doesn’t work that way, it needs to be updated when user edits the buffer, when the buffer is widened. If your goal is give lisp control of what tree-sitter sees, we can’t just give tree-sitter the whole buffer whenever the user makes some change. In the above scenario, then the Lisp program that narrows the buffer should figure out how to do that correctly. The call to TS will then express the changes in the narrowed region only. > > Anyway, you are once again bothered by a scenario that should not > > happen at all: a Lisp program should not call TS first with, then > > without narrowing (or the other way around). I don't see why such > > situation should happen, and if they do, the Lisp programs which need > > them will have to figure out what to do and how. > > Even if lisp always call tree-sitter with narrowing, we still need to update tree-sitter when the buffer is widened. No, I don't think so. Why would we need to? From the TS POV the text outside the restriction doesn't exist because it never sees it. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-29 17:38 ` Eli Zaretskii @ 2021-07-29 17:55 ` Yuan Fu 2021-07-29 18:37 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-07-29 17:55 UTC (permalink / raw) To: Eli Zaretskii; +Cc: cpitclaudel, Stephen Leake, monnier, emacs-devel >> >> Ideally, tree-sitter only sees the narrowed region because everytime it is called, the buffer is narrowed. However, tree-sitter doesn’t work that way, it needs to be updated when user edits the buffer, when the buffer is widened. If your goal is give lisp control of what tree-sitter sees, we can’t just give tree-sitter the whole buffer whenever the user makes some change. > > In the above scenario, then the Lisp program that narrows the buffer > should figure out how to do that correctly. The call to TS will then > express the changes in the narrowed region only. > >>> Anyway, you are once again bothered by a scenario that should not >>> happen at all: a Lisp program should not call TS first with, then >>> without narrowing (or the other way around). I don't see why such >>> situation should happen, and if they do, the Lisp programs which need >>> them will have to figure out what to do and how. >> >> Even if lisp always call tree-sitter with narrowing, we still need to update tree-sitter when the buffer is widened. > > No, I don't think so. Why would we need to? From the TS POV the text > outside the restriction doesn't exist because it never sees it. Actually, that sounds like how it works in my code right now. After the last few exchanges, I still have the feeling that we are not on the same page. Could you have a look at the code in ts_ensure_parsed and ts_record_change, and see if it aligns with what you consider to be the right thing? If you have read them already and think you understand what are they doing, could you tell me how exactly should these two functions behave, in your opinion? Thanks. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-29 17:55 ` Yuan Fu @ 2021-07-29 18:37 ` Eli Zaretskii 2021-07-29 18:57 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-29 18:37 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Thu, 29 Jul 2021 13:55:48 -0400 > Cc: Stephen Leake <stephen_leake@stephe-leake.org>, > cpitclaudel@gmail.com, > monnier@iro.umontreal.ca, > emacs-devel@gnu.org > > Actually, that sounds like how it works in my code right now. After the last few exchanges, I still have the feeling that we are not on the same page. Could you have a look at the code in ts_ensure_parsed and ts_record_change, and see if it aligns with what you consider to be the right thing? If you have read them already and think you understand what are they doing, could you tell me how exactly should these two functions behave, in your opinion? Thanks. Where do I find the latest version of the code? ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-29 18:37 ` Eli Zaretskii @ 2021-07-29 18:57 ` Yuan Fu 2021-07-30 6:47 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-07-29 18:57 UTC (permalink / raw) To: Eli Zaretskii Cc: Clément Pit-Claudel, Stephen Leake, Stefan Monnier, emacs-devel [-- Attachment #1: Type: text/plain, Size: 925 bytes --] > On Jul 29, 2021, at 2:37 PM, Eli Zaretskii <eliz@gnu.org> wrote: > >> From: Yuan Fu <casouri@gmail.com> >> Date: Thu, 29 Jul 2021 13:55:48 -0400 >> Cc: Stephen Leake <stephen_leake@stephe-leake.org>, >> cpitclaudel@gmail.com, >> monnier@iro.umontreal.ca, >> emacs-devel@gnu.org >> >> Actually, that sounds like how it works in my code right now. After the last few exchanges, I still have the feeling that we are not on the same page. Could you have a look at the code in ts_ensure_parsed and ts_record_change, and see if it aligns with what you consider to be the right thing? If you have read them already and think you understand what are they doing, could you tell me how exactly should these two functions behave, in your opinion? Thanks. > > Where do I find the latest version of the code? A few messages back I attached a patch, ts.5.patch. Actually I can just attach it again, here. Yuan [-- Attachment #2: ts.5.patch --] [-- Type: application/octet-stream, Size: 23721 bytes --] From 62fc019a7f57119329d53b9b8a3e8b5c1e61b27f Mon Sep 17 00:00:00 2001 From: Yuan Fu <casouri@gmail.com> Date: Wed, 28 Jul 2021 21:08:43 -0400 Subject: [PATCH] checkpoint 5 - Move define_error out of json.c - Add narrowing support --- lisp/tree-sitter.el | 11 +- src/eval.c | 13 ++ src/json.c | 16 --- src/lisp.h | 5 + src/tree_sitter.c | 231 +++++++++++++++++++++++----------- src/tree_sitter.h | 15 ++- test/src/tree-sitter-tests.el | 53 ++++++++ 7 files changed, 251 insertions(+), 93 deletions(-) diff --git a/lisp/tree-sitter.el b/lisp/tree-sitter.el index a6ecb09386..8a887bb406 100644 --- a/lisp/tree-sitter.el +++ b/lisp/tree-sitter.el @@ -102,12 +102,13 @@ tree-sitter-font-lock-settings PATTERN is a tree-sitter query pattern. (See manual for how to write query patterns.) This pattern should capture nodes with -either face names or function names. If captured with a face -name, the node's corresponding text in the buffer is fontified -with that face; if captured with a function name, the function is -called with three arguments, BEG END NODE, where BEG and END +either face symbols or function symbols. If captured with a face +symbol, the node's corresponding text in the buffer is fontified +with that face; if captured with a function symbol, the function +is called with three arguments, BEG END NODE, where BEG and END marks the span of the corresponding text, and NODE is the node -itself.") +itself. If a symbol is both a face and a function, it is treated +as a face.") (defun tree-sitter-fontify-region-function (beg end &optional verbose) "Fontify the region between BEG and END. diff --git a/src/eval.c b/src/eval.c index 18faa0b9b1..33c0763f38 100644 --- a/src/eval.c +++ b/src/eval.c @@ -1956,6 +1956,19 @@ signal_error (const char *s, Lisp_Object arg) xsignal (Qerror, Fcons (build_string (s), arg)); } +void +define_error (Lisp_Object name, const char *message, Lisp_Object parent) +{ + eassert (SYMBOLP (name)); + eassert (SYMBOLP (parent)); + Lisp_Object parent_conditions = Fget (parent, Qerror_conditions); + eassert (CONSP (parent_conditions)); + eassert (!NILP (Fmemq (parent, parent_conditions))); + eassert (NILP (Fmemq (name, parent_conditions))); + Fput (name, Qerror_conditions, pure_cons (name, parent_conditions)); + Fput (name, Qerror_message, build_pure_c_string (message)); +} + /* Use this for arithmetic overflow, e.g., when an integer result is too large even for a bignum. */ void diff --git a/src/json.c b/src/json.c index 3f1d27ad7f..ff28143a3c 100644 --- a/src/json.c +++ b/src/json.c @@ -1098,22 +1098,6 @@ DEFUN ("json-parse-buffer", Fjson_parse_buffer, Sjson_parse_buffer, return unbind_to (count, lisp); } -/* Simplified version of 'define-error' that works with pure - objects. */ - -static void -define_error (Lisp_Object name, const char *message, Lisp_Object parent) -{ - eassert (SYMBOLP (name)); - eassert (SYMBOLP (parent)); - Lisp_Object parent_conditions = Fget (parent, Qerror_conditions); - eassert (CONSP (parent_conditions)); - eassert (!NILP (Fmemq (parent, parent_conditions))); - eassert (NILP (Fmemq (name, parent_conditions))); - Fput (name, Qerror_conditions, pure_cons (name, parent_conditions)); - Fput (name, Qerror_message, build_pure_c_string (message)); -} - void syms_of_json (void) { diff --git a/src/lisp.h b/src/lisp.h index e439447283..d30509b61a 100644 --- a/src/lisp.h +++ b/src/lisp.h @@ -5127,6 +5127,11 @@ maybe_gc (void) maybe_garbage_collect (); } +/* Simplified version of 'define-error' that works with pure + objects. */ +void +define_error (Lisp_Object name, const char *message, Lisp_Object parent); + INLINE_HEADER_END #endif /* EMACS_LISP_H */ diff --git a/src/tree_sitter.c b/src/tree_sitter.c index e9f8ddc7e3..5e16df7758 100644 --- a/src/tree_sitter.c +++ b/src/tree_sitter.c @@ -19,17 +19,8 @@ Copyright (C) 2021 Free Software Foundation, Inc. #include <config.h> -#include <sys/types.h> -#include <sys/stat.h> -#include <sys/param.h> -#include <errno.h> -#include <stdio.h> -#include <stdlib.h> -#include <unistd.h> - #include "lisp.h" #include "buffer.h" -#include "coding.h" #include "tree_sitter.h" /* parser.h defines a macro ADVANCE that conflicts with alloc.c. */ @@ -61,6 +52,16 @@ DEFUN ("tree-sitter-node-p", /*** Parsing functions */ +static inline void +ts_tree_edit_1 (TSTree *tree, ptrdiff_t start_byte, + ptrdiff_t old_end_byte, ptrdiff_t new_end_byte) +{ + TSPoint dummy_point = {0, 0}; + TSInputEdit edit = {start_byte, old_end_byte, new_end_byte, + dummy_point, dummy_point, dummy_point}; + ts_tree_edit (tree, &edit); +} + /* Update each parser's tree after the user made an edit. This function does not parse the buffer and only updates the tree. (So it should be very fast.) */ @@ -68,18 +69,38 @@ DEFUN ("tree-sitter-node-p", ts_record_change (ptrdiff_t start_byte, ptrdiff_t old_end_byte, ptrdiff_t new_end_byte) { + eassert(start_byte <= old_end_byte); + eassert(start_byte <= new_end_byte); + Lisp_Object parser_list = Fsymbol_value (Qtree_sitter_parser_list); - TSPoint dummy_point = {0, 0}; - TSInputEdit edit = {start_byte, old_end_byte, new_end_byte, - dummy_point, dummy_point, dummy_point}; + while (!NILP (parser_list)) { Lisp_Object lisp_parser = Fcar (parser_list); TSTree *tree = XTS_PARSER (lisp_parser)->tree; if (tree != NULL) - ts_tree_edit (tree, &edit); - XTS_PARSER (lisp_parser)->need_reparse = true; - parser_list = Fcdr (parser_list); + { + /* We "clip" the change to between visible_beg and + visible_end. It is okay if visible_end ends up larger + than BUF_Z, tree-sitter only access buffer text during + re-parse, and we will adjust visible_beg/end before + re-parse. */ + ptrdiff_t visible_beg = XTS_PARSER (lisp_parser)->visible_beg; + ptrdiff_t visible_end = XTS_PARSER (lisp_parser)->visible_end; + + ptrdiff_t visible_start = + max (visible_beg, start_byte) - visible_beg; + ptrdiff_t visible_old_end = + min (visible_end, old_end_byte) - visible_beg; + ptrdiff_t visible_new_end = + min (visible_end, new_end_byte) - visible_beg; + + ts_tree_edit_1 (tree, visible_start, visible_old_end, + visible_new_end); + XTS_PARSER (lisp_parser)->need_reparse = true; + + parser_list = Fcdr (parser_list); + } } } @@ -93,16 +114,67 @@ ts_ensure_parsed (Lisp_Object parser) TSParser *ts_parser = XTS_PARSER (parser)->parser; TSTree *tree = XTS_PARSER(parser)->tree; TSInput input = XTS_PARSER (parser)->input; + struct buffer *buffer = XTS_PARSER (parser)->buffer; + + /* Before we parse, catch up with the narrowing situation. We + change visible_beg and visible_end to match BUF_BEGV_BYTE and + BUF_ZV_BYTE, and inform tree-sitter of the change. */ + ptrdiff_t visible_beg = XTS_PARSER (parser)->visible_beg; + ptrdiff_t visible_end = XTS_PARSER (parser)->visible_end; + /* Before re-parse, we want to move the visible range of tree-sitter + to matched the narrowed range. For example: + Move ________|____|__ + to |____|__________ */ + + /* 1. Make sure visible_beg <= BUF_BEGV_BYTE. */ + if (visible_beg > BUF_BEGV_BYTE (buffer)) + { + /* Tree-sitter sees: insert at the beginning. */ + ts_tree_edit_1 (tree, 0, 0, visible_beg - BUF_BEGV_BYTE (buffer)); + visible_beg = BUF_BEGV_BYTE (buffer); + } + /* 2. Make sure visible_end = BUF_ZV_BYTE. */ + if (visible_end < BUF_ZV_BYTE (buffer)) + { + /* Tree-sitter sees: insert at the end. */ + ts_tree_edit_1 (tree, visible_end - visible_beg, + visible_end - visible_beg, + BUF_ZV_BYTE (buffer) - visible_beg); + visible_end = BUF_ZV_BYTE (buffer); + } + else if (visible_end > BUF_ZV_BYTE (buffer)) + { + /* Tree-sitter sees: delete at the end. */ + ts_tree_edit_1 (tree, BUF_ZV_BYTE (buffer) - visible_beg, + visible_end - visible_beg, + BUF_ZV_BYTE (buffer) - visible_beg); + visible_end = BUF_ZV_BYTE (buffer); + } + /* 3. Make sure visible_beg = BUF_BEGV_BYTE. */ + if (visible_beg < BUF_BEGV_BYTE (buffer)) + { + /* Tree-sitter sees: delete at the beginning. */ + ts_tree_edit_1 (tree, 0, BUF_BEGV_BYTE (buffer) - visible_beg, 0); + visible_beg = BUF_BEGV_BYTE (buffer); + } + XTS_PARSER (parser)->visible_beg = visible_beg; + XTS_PARSER (parser)->visible_end = visible_end; + TSTree *new_tree = ts_parser_parse(ts_parser, tree, input); - /* This should be very rare: it only happens when 1) language is not - set (impossible in Emacs because the user has to supply a - language to create a parser), 2) parse canceled due to timeout - (impossible because we don't set a timeout), 3) parse canceled - due to cancellation flag (impossible because we don't set the - flag). (See comments for ts_parser_parse in + /* This should be very rare (impossible, really): it only happens + when 1) language is not set (impossible in Emacs because the user + has to supply a language to create a parser), 2) parse canceled + due to timeout (impossible because we don't set a timeout), 3) + parse canceled due to cancellation flag (impossible because we + don't set the flag). (See comments for ts_parser_parse in tree_sitter/api.h.) */ if (new_tree == NULL) - signal_error ("Parse failed", parser); + { + Lisp_Object buf; + XSETBUFFER(buf, buffer); + xsignal1 (Qtree_sitter_parse_error, buf); + } + ts_tree_delete (tree); XTS_PARSER (parser)->tree = new_tree; XTS_PARSER (parser)->need_reparse = false; @@ -110,13 +182,18 @@ ts_ensure_parsed (Lisp_Object parser) } /* This is the read function provided to tree-sitter to read from a - buffer. It reads one character at a time and automatically skip + buffer. It reads one character at a time and automatically skips the gap. */ const char* -ts_read_buffer (void *buffer, uint32_t byte_index, +ts_read_buffer (void *parser, uint32_t byte_index, TSPoint position, uint32_t *bytes_read) { - ptrdiff_t byte_pos = byte_index + 1; + struct buffer *buffer = ((struct Lisp_TS_Parser *) parser)->buffer; + ptrdiff_t visible_beg = ((struct Lisp_TS_Parser *) parser)->visible_beg; + ptrdiff_t byte_pos = byte_index + visible_beg; + /* We will make sure visible_beg >= BUF_BEG_BYTE before re-parse (in + ts_ensure_parsed), so byte_pos will never be smaller than + BUF_BEG_BYTE (unless byte_index < 0). */ /* Read one character. Tree-sitter wants us to set bytes_read to 0 if it reads to the end of buffer. It doesn't say what it wants @@ -126,26 +203,26 @@ ts_read_buffer (void *buffer, uint32_t byte_index, int len; /* This function could run from a user command, so it is better to do nothing instead of raising an error. (It was a pain in the a** - to read mega-if-conditions in Emacs source, so I write the two - branches separately, hoping the compiler can merge them.) */ - if (!BUFFER_LIVE_P ((struct buffer *) buffer)) + to decrypt mega-if-conditions in Emacs source, so I wrote the two + branches separately.) */ + if (!BUFFER_LIVE_P (buffer)) { beg = ""; len = 0; } - // TODO BUF_ZV_BYTE? - else if (byte_pos >= BUF_Z_BYTE ((struct buffer *) buffer)) + /* Reached visible end-of-buffer, tell tree-sitter to read no more. */ + else if (byte_pos >= BUF_ZV_BYTE (buffer)) { beg = ""; len = 0; } + /* Normal case, read a character. */ else { beg = (char *) BUF_BYTE_ADDRESS (buffer, byte_pos); - len = BYTES_BY_CHAR_HEAD ((int) beg); + len = BYTES_BY_CHAR_HEAD ((int) *beg); } *bytes_read = (uint32_t) len; - return beg; } @@ -158,13 +235,16 @@ make_ts_parser (struct buffer *buffer, TSParser *parser, { struct Lisp_TS_Parser *lisp_parser = ALLOCATE_PSEUDOVECTOR (struct Lisp_TS_Parser, name, PVEC_TS_PARSER); + lisp_parser->name = name; lisp_parser->buffer = buffer; lisp_parser->parser = parser; lisp_parser->tree = tree; - TSInput input = {buffer, ts_read_buffer, TSInputEncodingUTF8}; + TSInput input = {lisp_parser, ts_read_buffer, TSInputEncodingUTF8}; lisp_parser->input = input; lisp_parser->need_reparse = true; + lisp_parser->visible_beg = BUF_BEGV (buffer); + lisp_parser->visible_end = BUF_ZV (buffer); return make_lisp_ptr (lisp_parser, Lisp_Vectorlike); } @@ -287,7 +367,7 @@ DEFUN ("tree-sitter-parse-string", /* See comment in ts_ensure_parsed for possible reasons for a failure. */ if (tree == NULL) - signal_error ("Failed to parse STRING", string); + xsignal1 (Qtree_sitter_parse_error, string); TSNode root_node = ts_tree_root_node (tree); @@ -535,7 +615,9 @@ DEFUN ("tree-sitter-node-first-child-for-byte", { CHECK_INTEGER (pos); - struct buffer *buf = (XTS_PARSER (XTS_NODE (node)->parser)->buffer); + struct buffer *buf = XTS_PARSER (XTS_NODE (node)->parser)->buffer; + ptrdiff_t visible_beg = + XTS_PARSER (XTS_NODE (node)->parser)->visible_beg; ptrdiff_t byte_pos = XFIXNUM (pos); if (byte_pos < BUF_BEGV_BYTE (buf) || byte_pos > BUF_ZV_BYTE (buf)) @@ -544,9 +626,10 @@ DEFUN ("tree-sitter-node-first-child-for-byte", TSNode ts_node = XTS_NODE (node)->node; TSNode child; if (NILP (named)) - child = ts_node_first_child_for_byte (ts_node, byte_pos - 1); + child = ts_node_first_child_for_byte (ts_node, byte_pos - visible_beg); else - child = ts_node_first_named_child_for_byte (ts_node, byte_pos - 1); + child = ts_node_first_named_child_for_byte + (ts_node, byte_pos - visible_beg); if (ts_node_is_null(child)) return Qnil; @@ -566,7 +649,9 @@ DEFUN ("tree-sitter-node-descendant-for-byte-range", CHECK_INTEGER (beg); CHECK_INTEGER (end); - struct buffer *buf = (XTS_PARSER (XTS_NODE (node)->parser)->buffer); + struct buffer *buf = XTS_PARSER (XTS_NODE (node)->parser)->buffer; + ptrdiff_t visible_beg = + XTS_PARSER (XTS_NODE (node)->parser)->visible_beg; ptrdiff_t byte_beg = XFIXNUM (beg); ptrdiff_t byte_end = XFIXNUM (end); @@ -580,10 +665,10 @@ DEFUN ("tree-sitter-node-descendant-for-byte-range", TSNode child; if (NILP (named)) child = ts_node_descendant_for_byte_range - (ts_node, byte_beg - 1 , byte_end - 1); + (ts_node, byte_beg - visible_beg , byte_end - visible_beg); else child = ts_node_named_descendant_for_byte_range - (ts_node, byte_beg - 1, byte_end - 1); + (ts_node, byte_beg - visible_beg, byte_end - visible_beg); if (ts_node_is_null(child)) return Qnil; @@ -593,31 +678,24 @@ DEFUN ("tree-sitter-node-descendant-for-byte-range", /* Query functions */ -Lisp_Object ts_query_error_to_string (TSQueryError error) +char* +ts_query_error_to_string (TSQueryError error) { - char *error_name; switch (error) { case TSQueryErrorNone: - error_name = "none"; - break; + return "none"; case TSQueryErrorSyntax: - error_name = "syntax"; - break; + return "syntax"; case TSQueryErrorNodeType: - error_name = "node type"; - break; + return "node type"; case TSQueryErrorField: - error_name = "field"; - break; + return "field"; case TSQueryErrorCapture: - error_name = "capture"; - break; + return "capture"; case TSQueryErrorStructure: - error_name = "structure"; - break; + return "structure"; } - return make_pure_c_string (error_name, strlen(error_name)); } DEFUN ("tree-sitter-query-capture", @@ -634,7 +712,7 @@ DEFUN ("tree-sitter-query-capture", BEG and END, if _both_ non-nil, specifies the range in which the query is executed. -Return nil if the query failed. */) +Raise an tree-sitter-query-error if PATTERN is malformed. */) (Lisp_Object node, Lisp_Object pattern, Lisp_Object beg, Lisp_Object end) { @@ -643,47 +721,56 @@ DEFUN ("tree-sitter-query-capture", TSNode ts_node = XTS_NODE (node)->node; Lisp_Object lisp_parser = XTS_NODE (node)->parser; + ptrdiff_t visible_beg = + XTS_PARSER (XTS_NODE (node)->parser)->visible_beg; const TSLanguage *lang = ts_parser_language (XTS_PARSER (lisp_parser)->parser); char *source = SSDATA (pattern); + uint32_t error_offset; - uint32_t error_type; + TSQueryError error_type; TSQuery *query = ts_query_new (lang, source, strlen (source), &error_offset, &error_type); TSQueryCursor *cursor = ts_query_cursor_new (); if (query == NULL) { - // FIXME: Signal an error? - return Qnil; + // FIXME: Still crashes, debug when I can get a gdb. + xsignal2 (Qtree_sitter_query_error, + make_fixnum (error_offset), + build_string (ts_query_error_to_string (error_type))); } if (!NILP (beg) && !NILP (end)) { EMACS_INT beg_byte = XFIXNUM (beg); EMACS_INT end_byte = XFIXNUM (end); ts_query_cursor_set_byte_range - (cursor, (uint32_t) beg_byte - 1, (uint32_t) end_byte - 1); + (cursor, (uint32_t) beg_byte - visible_beg, + (uint32_t) end_byte - visible_beg); } ts_query_cursor_exec (cursor, query, ts_node); TSQueryMatch match; - TSQueryCapture capture; + Lisp_Object result = Qnil; - Lisp_Object entry; - Lisp_Object captured_node; - const char *capture_name; - uint32_t capture_name_len; while (ts_query_cursor_next_match (cursor, &match)) { const TSQueryCapture *captures = match.captures; for (int idx=0; idx < match.capture_count; idx++) { + TSQueryCapture capture; + Lisp_Object captured_node; + const char *capture_name; + Lisp_Object entry; + uint32_t capture_name_len; + capture = captures[idx]; captured_node = make_ts_node(lisp_parser, capture.node); capture_name = ts_query_capture_name_for_id (query, capture.index, &capture_name_len); - entry = Fcons (intern_c_string (capture_name), + entry = Fcons (intern_c_string_1 + (capture_name, capture_name_len), captured_node); result = Fcons (entry, result); } @@ -705,11 +792,15 @@ syms_of_tree_sitter (void) DEFSYM (Qhas_changes, "has-changes"); DEFSYM (Qhas_error, "has-error"); + DEFSYM(Qtree_sitter_error, "tree-sitter-error"); DEFSYM (Qtree_sitter_query_error, "tree-sitter-query-error"); - Fput (Qtree_sitter_query_error, Qerror_conditions, - pure_list (Qtree_sitter_query_error, Qerror)); - Fput (Qtree_sitter_query_error, Qerror_message, - build_pure_c_string ("Error with query pattern")) + DEFSYM (Qtree_sitter_parse_error, "tree-sitter-parse-error") + define_error (Qtree_sitter_error, "Generic tree-sitter error", Qerror); + define_error (Qtree_sitter_query_error, "Query pattern is malformed", + Qtree_sitter_error); + define_error (Qtree_sitter_parse_error, "Parse failed", + Qtree_sitter_error); + DEFSYM (Qtree_sitter_parser_list, "tree-sitter-parser-list"); DEFVAR_LISP ("tree-sitter-parser-list", Vtree_sitter_parser_list, diff --git a/src/tree_sitter.h b/src/tree_sitter.h index e9b4a71326..7e0fec0ee9 100644 --- a/src/tree_sitter.h +++ b/src/tree_sitter.h @@ -20,8 +20,6 @@ Copyright (C) 2021 Free Software Foundation, Inc. #ifndef EMACS_TREE_SITTER_H #define EMACS_TREE_SITTER_H -#include <sys/types.h> - #include "lisp.h" #include <tree_sitter/api.h> @@ -33,12 +31,25 @@ #define EMACS_TREE_SITTER_H struct Lisp_TS_Parser { union vectorlike_header header; + /* A parser's name is just a convenient tag, see docstring for + 'tree-sitter-make-parser', and 'tree-sitter-get-parser'. */ Lisp_Object name; struct buffer *buffer; TSParser *parser; TSTree *tree; TSInput input; + /* Re-parsing an unchanged buffer is not free for tree-sitter, so we + only make it re-parse when need_reparse == true. That usually + means some change is made in the buffer. But others could set + this field to true to force tree-sitter to re-parse. */ bool need_reparse; + /* This two positions record the byte position of the "visible + region" that tree-sitter sees. Unlike markers, These two + positions do not change as the user inserts and deletes text + around them. Before re-parse, we move these positions to match + BUF_BEGV_BYTE and BUF_ZV_BYTE. */ + ptrdiff_t visible_beg; + ptrdiff_t visible_end; }; /* A wrapper around a tree-sitter node. */ diff --git a/test/src/tree-sitter-tests.el b/test/src/tree-sitter-tests.el index c61ad678d2..69104568de 100644 --- a/test/src/tree-sitter-tests.el +++ b/test/src/tree-sitter-tests.el @@ -148,5 +148,58 @@ tree-sitter-query-api (cdr entry)))) (tree-sitter-query-capture root-node pattern))))))) +(ert-deftest tree-sitter-narrow () + "Tests if narrowing works." + (with-temp-buffer + (let (parser root-node pattern doc-node object-node pair-node) + (progn + (insert "xxx[1,{\"name\": \"Bob\"},2,3]xxx") + (narrow-to-region (+ (point-min) 3) (- (point-max) 3)) + (setq parser (tree-sitter-create-parser + (current-buffer) (tree-sitter-json))) + (setq root-node (tree-sitter-parser-root-node + parser))) + ;; This test is from the basic test. + (should + (equal + (tree-sitter-node-string + (tree-sitter-parser-root-node parser)) + "(document (array (number) (object (pair key: (string (string_content)) value: (string (string_content)))) (number) (number)))")) + + (widen) + (goto-char (point-min)) + (insert "ooo") + (should (equal "oooxxx[1,{\"name\": \"Bob\"},2,3]xxx" + (buffer-string))) + (delete-region 10 26) + (should (equal "oooxxx[1,2,3]xxx" + (buffer-string))) + (narrow-to-region (+ (point-min) 6) (- (point-max) 3)) + ;; This test is also from the basic test. + (should + (equal (tree-sitter-node-string + (tree-sitter-parser-root-node parser)) + "(document (array (number) (number) (number)))")) + (widen) + (goto-char (point-max)) + (insert "[1,2]") + (should (equal "oooxxx[1,2,3]xxx[1,2]" + (buffer-string))) + (narrow-to-region (- (point-max) 5) (point-max)) + (should + (equal (tree-sitter-node-string + (tree-sitter-parser-root-node parser)) + "(document (array (number) (number)))")) + (widen) + (goto-char (point-min)) + (insert "[1]") + (should (equal "[1]oooxxx[1,2,3]xxx[1,2]" + (buffer-string))) + (narrow-to-region (point-min) (+ (point-min) 3)) + (should + (equal (tree-sitter-node-string + (tree-sitter-parser-root-node parser)) + "(document (array (number)))"))))) + (provide 'tree-sitter-tests) ;;; tree-sitter-tests.el ends here -- 2.24.3 (Apple Git-128) ^ permalink raw reply related [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-29 18:57 ` Yuan Fu @ 2021-07-30 6:47 ` Eli Zaretskii 2021-07-30 14:17 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-07-30 6:47 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Thu, 29 Jul 2021 14:57:19 -0400 > Cc: Stephen Leake <stephen_leake@stephe-leake.org>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Stefan Monnier <monnier@iro.umontreal.ca>, > emacs-devel@gnu.org > > >> Actually, that sounds like how it works in my code right now. After the last few exchanges, I still have the feeling that we are not on the same page. Could you have a look at the code in ts_ensure_parsed and ts_record_change, and see if it aligns with what you consider to be the right thing? If you have read them already and think you understand what are they doing, could you tell me how exactly should these two functions behave, in your opinion? Thanks. > > > > Where do I find the latest version of the code? > > A few messages back I attached a patch, ts.5.patch. Actually I can just attach it again, here. That's not the whole code, that's a patch against some previous version of the code. So I cannot answer your questions with 100% certainty, until I see the entire code of the TS support. For example, I'm not sure I have a clear idea when are the two functions ts_ensure_parsed and ts_record_change called. That said, it looks like the code is correct: you should record the changes in the entire buffer, but only pass to TS the changes inside the restriction BEGV..ZV that is in effect at the time of the re-parse call. Btw, I don't see the code that filters changes reported to TS by their positions against the restriction; did I miss something? And one more question: I understand that ts_read_buffer doesn't check against BUF_BEGV_BYTE because TS never reads before the "visible beg" position, is that right? But if so, why do we need the similar test for BUF_ZV_BYTE? could TS attempt to read beyond the "visible end"? Thanks. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-30 6:47 ` Eli Zaretskii @ 2021-07-30 14:17 ` Yuan Fu 2021-08-03 10:24 ` Fu Yuan 2021-08-03 11:47 ` Eli Zaretskii 0 siblings, 2 replies; 370+ messages in thread From: Yuan Fu @ 2021-07-30 14:17 UTC (permalink / raw) To: Eli Zaretskii Cc: Clément Pit-Claudel, Stephen Leake, Stefan Monnier, emacs-devel > > That's not the whole code, that's a patch against some previous > version of the code. So I cannot answer your questions with 100% > certainty, until I see the entire code of the TS support. For > example, I'm not sure I have a clear idea when are the two functions > ts_ensure_parsed and ts_record_change called. Oops, I thought you have all prior patches. You can clone the “ts” branch from https://github.com/casouri/emacs.git If this is ok, I’ll push to this branch instead of sending patches from now on. > > That said, it looks like the code is correct: you should record the > changes in the entire buffer, but only pass to TS the changes inside > the restriction BEGV..ZV that is in effect at the time of the re-parse > call. Btw, I don't see the code that filters changes reported to TS > by their positions against the restriction; did I miss something? Yes, I do clip the change to the visible portion: ts_record_change (ptrdiff_t start_byte, ptrdiff_t old_end_byte, ptrdiff_t new_end_byte) { eassert(start_byte <= old_end_byte); eassert(start_byte <= new_end_byte); Lisp_Object parser_list = Fsymbol_value (Qtree_sitter_parser_list); while (!NILP (parser_list)) { Lisp_Object lisp_parser = Fcar (parser_list); TSTree *tree = XTS_PARSER (lisp_parser)->tree; if (tree != NULL) { /* We "clip" the change to between visible_beg and visible_end. It is okay if visible_end ends up larger than BUF_Z, tree-sitter only access buffer text during re-parse, and we will adjust visible_beg/end before re-parse. */ ptrdiff_t visible_beg = XTS_PARSER (lisp_parser)->visible_beg; ptrdiff_t visible_end = XTS_PARSER (lisp_parser)->visible_end; ptrdiff_t visible_start = max (visible_beg, start_byte) - visible_beg; ptrdiff_t visible_old_end = min (visible_end, old_end_byte) - visible_beg; ptrdiff_t visible_new_end = min (visible_end, new_end_byte) - visible_beg; ts_tree_edit_1 (tree, visible_start, visible_old_end, visible_new_end); XTS_PARSER (lisp_parser)->need_reparse = true; parser_list = Fcdr (parser_list); } } } > And one more question: I understand that ts_read_buffer doesn't check > against BUF_BEGV_BYTE because TS never reads before the "visible beg" > position, is that right? Yes, we always update visible_beg and visible_end to match BUF_BEGV_BYTE and BUF_ZV_BYTE before we instruct tree-sitter to re-parse. So when tree-sitter reads at byte position 0, it translates to buffer byte position 0 + visible_beg = BUF_BEGV_BYTE. > But if so, why do we need the similar test > for BUF_ZV_BYTE? could TS attempt to read beyond the "visible end”? Tree-sitter doesn’t know the size of the buffer, it just keeps reading until the read function sets bytes_read to 0, signaling that it has reached the end. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-30 14:17 ` Yuan Fu @ 2021-08-03 10:24 ` Fu Yuan 2021-08-03 11:42 ` Eli Zaretskii 2021-08-03 11:47 ` Eli Zaretskii 1 sibling, 1 reply; 370+ messages in thread From: Fu Yuan @ 2021-08-03 10:24 UTC (permalink / raw) To: Eli Zaretskii Cc: Clément Pit-Claudel, Stephen Leake, Stefan Monnier, emacs-devel I’m about to change all lisp-facing functions from using byte position to using point. Point is much easier to work with. If lisp wants byte positions, they can just convert from point themselves. Any objections? Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-08-03 10:24 ` Fu Yuan @ 2021-08-03 11:42 ` Eli Zaretskii 2021-08-03 11:53 ` Fu Yuan 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-08-03 11:42 UTC (permalink / raw) To: Fu Yuan; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel > From: Fu Yuan <casouri@gmail.com> > Date: Tue, 3 Aug 2021 06:24:34 -0400 > Cc: Stephen Leake <stephen_leake@stephe-leake.org>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel@gnu.org > > I’m about to change all lisp-facing functions from using byte position to using point. I don't understand how can you do this. Point is set by Lisp, and generally cannot be changed from C, except for very short durations of time (or if the C code is the implementation of a Lisp command that just moved point). If you need to access some buffer position, you cannot in general use point, because you cannot control where point is. > Point is much easier to work with. In what way is it easier? I feel that I'm missing something here. > If lisp wants byte positions, they can just convert from point themselves. ??? What do you mean by that? Can you show an example? ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-08-03 11:42 ` Eli Zaretskii @ 2021-08-03 11:53 ` Fu Yuan 2021-08-03 12:21 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Fu Yuan @ 2021-08-03 11:53 UTC (permalink / raw) To: Eli Zaretskii; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel > 在 2021年8月3日,上午7:42,Eli Zaretskii <eliz@gnu.org> 写道: > > >> >> From: Fu Yuan <casouri@gmail.com> >> Date: Tue, 3 Aug 2021 06:24:34 -0400 >> Cc: Stephen Leake <stephen_leake@stephe-leake.org>, >> Clément Pit-Claudel <cpitclaudel@gmail.com>, >> Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel@gnu.org >> >> I’m about to change all lisp-facing functions from using byte position to using point. > > I don't understand how can you do this. Point is set by Lisp, and > generally cannot be changed from C, except for very short durations of > time (or if the C code is the implementation of a Lisp command that > just moved point). If you need to access some buffer position, you > cannot in general use point, because you cannot control where point > is. > >> Point is much easier to work with. > > In what way is it easier? I feel that I'm missing something here. > >> If lisp wants byte positions, they can just convert from point themselves. > > ??? What do you mean by that? Can you show an example? Oh no, I don’t mean that. I meant that, for example, functions like node_start_byte, which returns the byte position of the beginning of the node, will now be node_start_pos, which returns a point position. And if I want the byte position of the beginning of the node, I can use (position-to-byte (tree-sitter-node-start-pos node)) Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-08-03 11:53 ` Fu Yuan @ 2021-08-03 12:21 ` Eli Zaretskii 2021-08-03 12:50 ` Fu Yuan 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-08-03 12:21 UTC (permalink / raw) To: Fu Yuan; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel > From: Fu Yuan <casouri@gmail.com> > Date: Tue, 3 Aug 2021 07:53:54 -0400 > Cc: stephen_leake@stephe-leake.org, cpitclaudel@gmail.com, > monnier@iro.umontreal.ca, emacs-devel@gnu.org > > Oh no, I don’t mean that. I meant that, for example, functions like node_start_byte, which returns the byte position of the beginning of the node, will now be node_start_pos, which returns a point position. That's called "character position". Let's use the accepted terminology, to minimize misunderstandings. So in what sense are character positions easier to use than byte positions? > And if I want the byte position of the beginning of the node, I can use (position-to-byte (tree-sitter-node-start-pos node)) Caveat: position-to-byte can be expensive. So in time-critical code, such as the display engine, we keep both character position and byte position, and update them in sync. Then you can use whichever is easier in each case. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-08-03 12:21 ` Eli Zaretskii @ 2021-08-03 12:50 ` Fu Yuan 2021-08-03 13:03 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Fu Yuan @ 2021-08-03 12:50 UTC (permalink / raw) To: Eli Zaretskii; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel > 在 2021年8月3日,上午8:22,Eli Zaretskii <eliz@gnu.org> 写道: > > >> >> From: Fu Yuan <casouri@gmail.com> >> Date: Tue, 3 Aug 2021 07:53:54 -0400 >> Cc: stephen_leake@stephe-leake.org, cpitclaudel@gmail.com, >> monnier@iro.umontreal.ca, emacs-devel@gnu.org >> >> Oh no, I don’t mean that. I meant that, for example, functions like node_start_byte, which returns the byte position of the beginning of the node, will now be node_start_pos, which returns a point position. > > That's called "character position". Let's use the accepted > terminology, to minimize misunderstandings. Ah, got it. > So in what sense are character positions easier to use than byte > positions? Here are what you can do with positions: - find the smallest node that encloses a range (BEG . END) - get the beginning and end of a node Since all other functions use character position (eg, put-text-property, point), using character positions saves lisp code some ‘position-to-bytes’. >> And if I want the byte position of the beginning of the node, I can use (position-to-byte (tree-sitter-node-start-pos node)) > > Caveat: position-to-byte can be expensive. So in time-critical code, > such as the display engine, we keep both character position and byte > position, and update them in sync. Then you can use whichever is > easier in each case. Internally, tree_sitter.c will continue to use byte positions, of course. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-08-03 12:50 ` Fu Yuan @ 2021-08-03 13:03 ` Eli Zaretskii 2021-08-03 13:08 ` Fu Yuan 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-08-03 13:03 UTC (permalink / raw) To: Fu Yuan; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel > From: Fu Yuan <casouri@gmail.com> > Date: Tue, 3 Aug 2021 08:50:45 -0400 > Cc: stephen_leake@stephe-leake.org, cpitclaudel@gmail.com, > monnier@iro.umontreal.ca, emacs-devel@gnu.org > > > So in what sense are character positions easier to use than byte > > positions? > > Here are what you can do with positions: > > - find the smallest node that encloses a range (BEG . END) > - get the beginning and end of a node > > Since all other functions use character position (eg, put-text-property, point), using character positions saves lisp code some ‘position-to-bytes’. If you are talking about Lisp, then yes, character positions are a much better interface. But on the C level, sometimes you need byte positions, sometimes character positions, and sometimes both. Since you didn't say what level was this about, I cannot say something more intelligent. > Internally, tree_sitter.c will continue to use byte positions, of course. "Internally", as opposed to what? And what is "internal" in this context? I thought we were talking only about the internals. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-08-03 13:03 ` Eli Zaretskii @ 2021-08-03 13:08 ` Fu Yuan 0 siblings, 0 replies; 370+ messages in thread From: Fu Yuan @ 2021-08-03 13:08 UTC (permalink / raw) To: Eli Zaretskii; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel > 在 2021年8月3日,上午9:03,Eli Zaretskii <eliz@gnu.org> 写道: > > >> >> From: Fu Yuan <casouri@gmail.com> >> Date: Tue, 3 Aug 2021 08:50:45 -0400 >> Cc: stephen_leake@stephe-leake.org, cpitclaudel@gmail.com, >> monnier@iro.umontreal.ca, emacs-devel@gnu.org >> >>> So in what sense are character positions easier to use than byte >>> positions? >> >> Here are what you can do with positions: >> >> - find the smallest node that encloses a range (BEG . END) >> - get the beginning and end of a node >> >> Since all other functions use character position (eg, put-text-property, point), using character positions saves lisp code some ‘position-to-bytes’. > > If you are talking about Lisp, then yes, character positions are a > much better interface. But on the C level, sometimes you need byte > positions, sometimes character positions, and sometimes both. Since > you didn't say what level was this about, I cannot say something more > intelligent. > >> Internally, tree_sitter.c will continue to use byte positions, of course. > > "Internally", as opposed to what? And what is "internal" in this > context? I thought we were talking only about the internals. By internally I mean C level. I will change lisp interface functions to accept and return character positions, and C level code will keep using byte positions. I’ll try to make myself clearer next time :-) Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-07-30 14:17 ` Yuan Fu 2021-08-03 10:24 ` Fu Yuan @ 2021-08-03 11:47 ` Eli Zaretskii 2021-08-03 12:00 ` Fu Yuan 1 sibling, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-08-03 11:47 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Fri, 30 Jul 2021 10:17:22 -0400 > Cc: Stephen Leake <stephen_leake@stephe-leake.org>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Stefan Monnier <monnier@iro.umontreal.ca>, > emacs-devel@gnu.org > > > That said, it looks like the code is correct: you should record the > > changes in the entire buffer, but only pass to TS the changes inside > > the restriction BEGV..ZV that is in effect at the time of the re-parse > > call. Btw, I don't see the code that filters changes reported to TS > > by their positions against the restriction; did I miss something? > > Yes, I do clip the change to the visible portion: > > ts_record_change (ptrdiff_t start_byte, ptrdiff_t old_end_byte, > ptrdiff_t new_end_byte) > { > eassert(start_byte <= old_end_byte); > eassert(start_byte <= new_end_byte); > > Lisp_Object parser_list = Fsymbol_value (Qtree_sitter_parser_list); > > while (!NILP (parser_list)) > { > Lisp_Object lisp_parser = Fcar (parser_list); > TSTree *tree = XTS_PARSER (lisp_parser)->tree; > if (tree != NULL) > { > /* We "clip" the change to between visible_beg and > visible_end. It is okay if visible_end ends up larger > than BUF_Z, tree-sitter only access buffer text during > re-parse, and we will adjust visible_beg/end before > re-parse. */ > ptrdiff_t visible_beg = XTS_PARSER (lisp_parser)->visible_beg; > ptrdiff_t visible_end = XTS_PARSER (lisp_parser)->visible_end; > > ptrdiff_t visible_start = > max (visible_beg, start_byte) - visible_beg; > ptrdiff_t visible_old_end = > min (visible_end, old_end_byte) - visible_beg; > ptrdiff_t visible_new_end = > min (visible_end, new_end_byte) - visible_beg; > > ts_tree_edit_1 (tree, visible_start, visible_old_end, > visible_new_end); > XTS_PARSER (lisp_parser)->need_reparse = true; > > parser_list = Fcdr (parser_list); Hmm... so a change that begins before the restriction and ends inside the restriction will be sent as if it began at BEGV? And the rest of the change will be discarded? Shouldn't you split such changes in tow, send to TS the part inside the restriction, and store the rest for the future, when/if the buffer is widened? ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-08-03 11:47 ` Eli Zaretskii @ 2021-08-03 12:00 ` Fu Yuan 2021-08-03 12:24 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Fu Yuan @ 2021-08-03 12:00 UTC (permalink / raw) To: Eli Zaretskii; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel > 在 2021年8月3日,上午7:48,Eli Zaretskii <eliz@gnu.org> 写道: > > >> >> From: Yuan Fu <casouri@gmail.com> >> Date: Fri, 30 Jul 2021 10:17:22 -0400 >> Cc: Stephen Leake <stephen_leake@stephe-leake.org>, >> Clément Pit-Claudel <cpitclaudel@gmail.com>, >> Stefan Monnier <monnier@iro.umontreal.ca>, >> emacs-devel@gnu.org >> >>> That said, it looks like the code is correct: you should record the >>> changes in the entire buffer, but only pass to TS the changes inside >>> the restriction BEGV..ZV that is in effect at the time of the re-parse >>> call. Btw, I don't see the code that filters changes reported to TS >>> by their positions against the restriction; did I miss something? >> >> Yes, I do clip the change to the visible portion: >> >> ts_record_change (ptrdiff_t start_byte, ptrdiff_t old_end_byte, >> ptrdiff_t new_end_byte) >> { >> eassert(start_byte <= old_end_byte); >> eassert(start_byte <= new_end_byte); >> >> Lisp_Object parser_list = Fsymbol_value (Qtree_sitter_parser_list); >> >> while (!NILP (parser_list)) >> { >> Lisp_Object lisp_parser = Fcar (parser_list); >> TSTree *tree = XTS_PARSER (lisp_parser)->tree; >> if (tree != NULL) >> { >> /* We "clip" the change to between visible_beg and >> visible_end. It is okay if visible_end ends up larger >> than BUF_Z, tree-sitter only access buffer text during >> re-parse, and we will adjust visible_beg/end before >> re-parse. */ >> ptrdiff_t visible_beg = XTS_PARSER (lisp_parser)->visible_beg; >> ptrdiff_t visible_end = XTS_PARSER (lisp_parser)->visible_end; >> >> ptrdiff_t visible_start = >> max (visible_beg, start_byte) - visible_beg; >> ptrdiff_t visible_old_end = >> min (visible_end, old_end_byte) - visible_beg; >> ptrdiff_t visible_new_end = >> min (visible_end, new_end_byte) - visible_beg; >> >> ts_tree_edit_1 (tree, visible_start, visible_old_end, >> visible_new_end); >> XTS_PARSER (lisp_parser)->need_reparse = true; >> >> parser_list = Fcdr (parser_list); > > Hmm... so a change that begins before the restriction and ends inside > the restriction will be sent as if it began at BEGV? And the rest of > the change will be discarded? Shouldn't you split such changes in > tow, send to TS the part inside the restriction, and store the rest > for the future, when/if the buffer is widened? Tree-sitter doesn’t care about the content in a change, it will re-scan the buffer content when it re-parses. We only need to inform it the range of the change, so it knows where to re-scan when it re-parses. When the buffer is widened, we will tell tree-sitter that range [BUF_BEG, BUF_BEGV] has changed, and it will re-scan that part when re-parsing. So the part outside the narrowed region will be parsed correctly. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-08-03 12:00 ` Fu Yuan @ 2021-08-03 12:24 ` Eli Zaretskii 2021-08-03 13:00 ` Fu Yuan 2021-08-03 13:28 ` Stefan Monnier 0 siblings, 2 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-08-03 12:24 UTC (permalink / raw) To: Fu Yuan; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel > From: Fu Yuan <casouri@gmail.com> > Date: Tue, 3 Aug 2021 08:00:46 -0400 > Cc: stephen_leake@stephe-leake.org, cpitclaudel@gmail.com, > monnier@iro.umontreal.ca, emacs-devel@gnu.org > > > Hmm... so a change that begins before the restriction and ends inside > > the restriction will be sent as if it began at BEGV? And the rest of > > the change will be discarded? Shouldn't you split such changes in > > tow, send to TS the part inside the restriction, and store the rest > > for the future, when/if the buffer is widened? > > Tree-sitter doesn’t care about the content in a change, it will re-scan the buffer content when it re-parses. We only need to inform it the range of the change, so it knows where to re-scan when it re-parses. When the buffer is widened, we will tell tree-sitter that range [BUF_BEG, BUF_BEGV] has changed, and it will re-scan that part when re-parsing. But that's sub-optimal, no? Imagine a very large buffer which was narrowed to a small portion near EOB, then a modification made very close to EOB but partially before BEGV, then the buffer widened. With your method, TS will now have to re-parse almost the entire buffer, whereas we know it needs to re-parse a very small portion of it. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-08-03 12:24 ` Eli Zaretskii @ 2021-08-03 13:00 ` Fu Yuan 2021-08-03 13:28 ` Stefan Monnier 1 sibling, 0 replies; 370+ messages in thread From: Fu Yuan @ 2021-08-03 13:00 UTC (permalink / raw) To: Eli Zaretskii; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel > 在 2021年8月3日,上午8:25,Eli Zaretskii <eliz@gnu.org> 写道: > > >> >> From: Fu Yuan <casouri@gmail.com> >> Date: Tue, 3 Aug 2021 08:00:46 -0400 >> Cc: stephen_leake@stephe-leake.org, cpitclaudel@gmail.com, >> monnier@iro.umontreal.ca, emacs-devel@gnu.org >> >>> Hmm... so a change that begins before the restriction and ends inside >>> the restriction will be sent as if it began at BEGV? And the rest of >>> the change will be discarded? Shouldn't you split such changes in >>> tow, send to TS the part inside the restriction, and store the rest >>> for the future, when/if the buffer is widened? >> >> Tree-sitter doesn’t care about the content in a change, it will re-scan the buffer content when it re-parses. We only need to inform it the range of the change, so it knows where to re-scan when it re-parses. When the buffer is widened, we will tell tree-sitter that range [BUF_BEG, BUF_BEGV] has changed, and it will re-scan that part when re-parsing. > > But that's sub-optimal, no? Imagine a very large buffer which was > narrowed to a small portion near EOB, then a modification made very > close to EOB but partially before BEGV, then the buffer widened. With > your method, TS will now have to re-parse almost the entire buffer, > whereas we know it needs to re-parse a very small portion of it. It is indeed, but that’s unavoidable by the way we hide the hidden part of the buffer from tree-sitter. We pretend BUF_BEGV is the beginning of the buffer and nothing exists before it. Then when we widen, we need to “insert” the content between BUF_BEG and BUF_BEGV. I.e., as far as tree-sitter can tell, we inserted that text. If you want to hide something then re-show it to tree-sitter, and want tree-sitter to know how to re-parse minimally, you should use tree-sitter-parser-set-included-ranges (ts_parser_set_included_ranges). I’ve wrote the lisp binding for it but haven’t pushed the change. The reason why I didn’t implement narrow with set-ranges was explained earlier. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-08-03 12:24 ` Eli Zaretskii 2021-08-03 13:00 ` Fu Yuan @ 2021-08-03 13:28 ` Stefan Monnier 2021-08-03 13:34 ` Eli Zaretskii 1 sibling, 1 reply; 370+ messages in thread From: Stefan Monnier @ 2021-08-03 13:28 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Fu Yuan, stephen_leake, cpitclaudel, emacs-devel > But that's sub-optimal, no? Imagine a very large buffer which was > narrowed to a small portion near EOB, then a modification made very > close to EOB but partially before BEGV, then the buffer widened. With > your method, TS will now have to re-parse almost the entire buffer, > whereas we know it needs to re-parse a very small portion of it. As a general rule, we will most likely want to work hard to avoid exposing the narrowed buffer to TS (i.e. most calls to TS will first `widen`). Or we will want to keep several parse trees (one per narrowing). We have the same problem already with `syntax-ppss` which we solve by keeping two sets of data (`syntax-ppss-wide` and `syntax-ppss-narrow`). Stefan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-08-03 13:28 ` Stefan Monnier @ 2021-08-03 13:34 ` Eli Zaretskii 2021-08-06 3:22 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-08-03 13:34 UTC (permalink / raw) To: Stefan Monnier; +Cc: casouri, stephen_leake, cpitclaudel, emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: Fu Yuan <casouri@gmail.com>, stephen_leake@stephe-leake.org, > cpitclaudel@gmail.com, emacs-devel@gnu.org > Date: Tue, 03 Aug 2021 09:28:57 -0400 > > > But that's sub-optimal, no? Imagine a very large buffer which was > > narrowed to a small portion near EOB, then a modification made very > > close to EOB but partially before BEGV, then the buffer widened. With > > your method, TS will now have to re-parse almost the entire buffer, > > whereas we know it needs to re-parse a very small portion of it. > > As a general rule, we will most likely want to work hard to avoid > exposing the narrowed buffer to TS (i.e. most calls to TS will first > `widen`). Sure. I was thinking about those corner cases where we won't. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-08-03 13:34 ` Eli Zaretskii @ 2021-08-06 3:22 ` Yuan Fu 2021-08-06 6:37 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-08-06 3:22 UTC (permalink / raw) To: Eli Zaretskii; +Cc: cpitclaudel, stephen_leake, Stefan Monnier, emacs-devel I’ve added bindings for set_ranges and pushed the latest code to https://github.com/casouri/emacs/tree/ts As for now, I’ve created bindings for most of the functions I want to expose. Next I’ll probably write some more tests (in addition to responding reviews and comments). Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: How to add pseudo vector types 2021-08-06 3:22 ` Yuan Fu @ 2021-08-06 6:37 ` Eli Zaretskii 2021-08-07 5:31 ` Tree-sitter api (Was: Re: How to add pseudo vector types) Fu Yuan 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-08-06 6:37 UTC (permalink / raw) To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Thu, 5 Aug 2021 23:22:17 -0400 > Cc: cpitclaudel@gmail.com, stephen_leake@stephe-leake.org, > Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel@gnu.org > > I’ve added bindings for set_ranges and pushed the latest code to https://github.com/casouri/emacs/tree/ts > > As for now, I’ve created bindings for most of the functions I want to expose. Next I’ll probably write some more tests (in addition to responding reviews and comments). Thanks. We should probably start thinking how to integrate TS-related functionalities into Emacs in general. E.g., should there be an option to activate it? should this option be per major mode? something else? ^ permalink raw reply [flat|nested] 370+ messages in thread
* Tree-sitter api (Was: Re: How to add pseudo vector types) 2021-08-06 6:37 ` Eli Zaretskii @ 2021-08-07 5:31 ` Fu Yuan 2021-08-07 6:26 ` Eli Zaretskii 2021-08-07 15:47 ` Tree-sitter api Stefan Monnier 0 siblings, 2 replies; 370+ messages in thread From: Fu Yuan @ 2021-08-07 5:31 UTC (permalink / raw) To: Eli Zaretskii; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel > 在 2021年8月6日,上午1:37,Eli Zaretskii <eliz@gnu.org> 写道: > > >> >> From: Yuan Fu <casouri@gmail.com> >> Date: Thu, 5 Aug 2021 23:22:17 -0400 >> Cc: cpitclaudel@gmail.com, stephen_leake@stephe-leake.org, >> Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel@gnu.org >> >> I’ve added bindings for set_ranges and pushed the latest code to https://github.com/casouri/emacs/tree/ts >> >> As for now, I’ve created bindings for most of the functions I want to expose. Next I’ll probably write some more tests (in addition to responding reviews and comments). > > Thanks. > > We should probably start thinking how to integrate TS-related > functionalities into Emacs in general. E.g., should there be an > option to activate it? should this option be per major mode? something > else? We should have a user option to control tree-sitter on major mode level. Maybe an alist where each car is a major node symbol and each cdr is a Boolean value toggling tree-sitter for that node. We also need tree-sitter-maximum-buffer-size, so that buffer larger than this size won’t enable tree-sitter. (And we need to make sure we never use tree-sitter on buffers larger than 4GB because tree-sitter uses unint32.) And we can provide a function free-sitter-should-activate-p that computes if we should enable tree-sitter in the current buffer by variables mentioned above, that can be used by major-modes when setting up. I’m also thinking about having a tree-sitter-defaults that’s analogous to font-lock-defaults, that is set by each major node and used to generate tree-sitter-font-lock-settings. As for indentation, we could provide some infrastructure like we do for font-locking, or we can just let major modes implement their indent function with tree-sitter api. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api (Was: Re: How to add pseudo vector types) 2021-08-07 5:31 ` Tree-sitter api (Was: Re: How to add pseudo vector types) Fu Yuan @ 2021-08-07 6:26 ` Eli Zaretskii 2021-08-07 15:47 ` Tree-sitter api Stefan Monnier 1 sibling, 0 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-08-07 6:26 UTC (permalink / raw) To: Fu Yuan; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel > From: Fu Yuan <casouri@gmail.com> > Date: Sat, 7 Aug 2021 00:31:36 -0500 > Cc: cpitclaudel@gmail.com, stephen_leake@stephe-leake.org, > monnier@iro.umontreal.ca, emacs-devel@gnu.org > > > We should probably start thinking how to integrate TS-related > > functionalities into Emacs in general. E.g., should there be an > > option to activate it? should this option be per major mode? something > > else? > > We should have a user option to control tree-sitter on major mode level. Maybe an alist where each car is a major node symbol and each cdr is a Boolean value toggling tree-sitter for that node. > > We also need tree-sitter-maximum-buffer-size, so that buffer larger than this size won’t enable tree-sitter. (And we need to make sure we never use tree-sitter on buffers larger than 4GB because tree-sitter uses unint32.) > > And we can provide a function free-sitter-should-activate-p that computes if we should enable tree-sitter in the current buffer by variables mentioned above, that can be used by major-modes when setting up. > > I’m also thinking about having a tree-sitter-defaults that’s analogous to font-lock-defaults, that is set by each major node and used to generate tree-sitter-font-lock-settings. > > As for indentation, we could provide some infrastructure like we do for font-locking, or we can just let major modes implement their indent function with tree-sitter api. SGTM, thanks. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-08-07 5:31 ` Tree-sitter api (Was: Re: How to add pseudo vector types) Fu Yuan 2021-08-07 6:26 ` Eli Zaretskii @ 2021-08-07 15:47 ` Stefan Monnier 2021-08-07 18:40 ` Theodor Thornhill 2021-08-08 22:56 ` Yuan Fu 1 sibling, 2 replies; 370+ messages in thread From: Stefan Monnier @ 2021-08-07 15:47 UTC (permalink / raw) To: Fu Yuan; +Cc: Eli Zaretskii, cpitclaudel, stephen_leake, emacs-devel > We should have a user option to control tree-sitter on major mode > level. Maybe an alist where each car is a major node symbol and each cdr is > a Boolean value toggling tree-sitter for that node. The more traditional approach is to use a buffer-local var set by the major mode or set via (add-hook '<MODE>-hook ...). > As for indentation, we could provide some infrastructure like we do for > font-locking, or we can just let major modes implement their indent function > with tree-sitter api. We should definitely provide the infrastructure (even if it's fairly simple) so that major modes only have to provide some rules. Stefan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-08-07 15:47 ` Tree-sitter api Stefan Monnier @ 2021-08-07 18:40 ` Theodor Thornhill 2021-08-07 19:53 ` Stefan Monnier 2021-08-08 22:56 ` Yuan Fu 1 sibling, 1 reply; 370+ messages in thread From: Theodor Thornhill @ 2021-08-07 18:40 UTC (permalink / raw) To: Stefan Monnier, Fu Yuan Cc: Eli Zaretskii, cpitclaudel, stephen_leake, emacs-devel Stefan Monnier <monnier@iro.umontreal.ca> writes: >> We should have a user option to control tree-sitter on major mode >> level. Maybe an alist where each car is a major node symbol and each cdr is >> a Boolean value toggling tree-sitter for that node. > > The more traditional approach is to use a buffer-local var set by the > major mode or set via (add-hook '<MODE>-hook ...). > >> As for indentation, we could provide some infrastructure like we do for >> font-locking, or we can just let major modes implement their indent function >> with tree-sitter api. > > We should definitely provide the infrastructure (even if it's fairly > simple) so that major modes only have to provide some rules. > Yeah, though that quickly becomes not so simple, considering that different languages have their own idiosyncrasies with indentation. C#, for instance, is a rats nest of particularities. And this is not considering variations of style guides etc. It would be nice to get an api similar to what CC mode has. Font locking is an easier problem, since it's just "fontify from node-start to node-end". I'm not sure how to best provide this api, but I've worked a lot with CC mode and the new tree-sitter-indent [1]. It quickly gets confusing and reminds me of `display-buffer`. Providing both a `tree-sitter-indent-engine` mode as well as the low level api for major mode authors would be nice as well. Providing something too simple would just make people not use it, since the weirder cases won't be covered. -- Theo [1]: https://codeberg.org/FelipeLema/tree-sitter-indent.el ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-08-07 18:40 ` Theodor Thornhill @ 2021-08-07 19:53 ` Stefan Monnier 2021-08-17 6:18 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Stefan Monnier @ 2021-08-07 19:53 UTC (permalink / raw) To: Theodor Thornhill Cc: Fu Yuan, Eli Zaretskii, cpitclaudel, stephen_leake, emacs-devel > Yeah, though that quickly becomes not so simple, considering that > different languages have their own idiosyncrasies with indentation. C#, > for instance, is a rats nest of particularities. And this is not > considering variations of style guides etc. It would be nice to get an > api similar to what CC mode has. I'm thinking of rules specified via a function that takes a TS node (from which the function can explore the rest of the TS tree) and return the indentation to use, represented as a pair (POSITION . OFFSET) (meaning to indent OFFSET columns further than the column position of POSITION). The infrastructure would limit itself to making sure we have an uptodate tree (computed from a properly widened buffer), find the node corresponding to point pass it to the function and then turn the return value into an actual column and indent the text accordingly (paying attention to the usual difference between when point is "within the indentation" vs "within the text"). Stefan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-08-07 19:53 ` Stefan Monnier @ 2021-08-17 6:18 ` Yuan Fu 2021-08-18 18:27 ` Stephen Leake ` (2 more replies) 0 siblings, 3 replies; 370+ messages in thread From: Yuan Fu @ 2021-08-17 6:18 UTC (permalink / raw) To: Stefan Monnier Cc: Stephen Leake, Eli Zaretskii, Theodor Thornhill, Clément Pit-Claudel, emacs-devel > > I'm thinking of rules specified via a function that takes a TS node > (from which the function can explore the rest of the TS tree) and return > the indentation to use, represented as a pair (POSITION . OFFSET) > (meaning to indent OFFSET columns further than the column position of > POSITION). > > The infrastructure would limit itself to making sure we have an uptodate > tree (computed from a properly widened buffer), find the node > corresponding to point pass it to the function and then turn the return > value into an actual column and indent the text accordingly (paying > attention to the usual difference between when point is "within the > indentation" vs "within the text”). Okay, here is the (ad-hoc) infrastructure I came up with: We have a tree-sitter-simple-indent-function. Major-mode authors can set indent-line-function to it to use the simple-indent system. tree-sitter-simple-indent-function indents according to tree-sitter-simple-indent-rules. Doc string of tree-sitter-simple-indent-rules reads: A list of indent rule settings. Each indent rule setting should be (LANGUAGE . RULES), where LANGUAGE is a language symbol, and RULES is a list of (MATCHER ANCHOR OFFSET). MATCHER determines whether this rule applies, ANCHOR and OFFSET together determines which column to indent to. A MATCHER is a function that takes three arguments (NODE PARENT BOL). NODE is the largest (highest-in-tree) node starting at point. PARENT is the parent of NODE. BOL is the point where we are indenting: the beginning of line content, the position of the first non-whitespace character. If MATCHER returns non-nil, meaning the rule matches, Emacs then uses ANCHOR to find an anchor, it should be a function that takes the same argument (NODE PARENT BOL) and returns a point. Finally Emacs computes the column of that point returned by ANCHOR and adds OFFSET to it, and indent the line to that column. For MATCHER and ANCHOR, Emacs provides some convenient presets. See `tree-sitter-simple-indent-presets’. And doc string for tree-sitter-simple-indent-presets: A list of presets. These presets can be used as MATHER and ANCHOR in `tree-sitter-simple-indent-rules'. MATCHER: (match NODE-TYPE PARENT-TYPE NODE-FIELD NODE-INDEX-MIN NODE-INDEX-MAX) NODE-TYPE checks for node's type, PARENT-TYPE check for parent's type, NODE-FIELD checks for the filed name of node in the parent, NODE-INDEX-MIN and NODE-INDEX-MAX checks for the node's index in the parent. Therefore, to match the first child where parent is \"argument_list\", use (match nil \"argument_list\" nil nil 0 0). no-node Matches the case where node is nil, i.e., there is no node that starts at point. This is the case when indenting an empty line. (node-at-point TYPE NAMED) Check that the node at point -- not the largest node starting at point -- has type TYPE. If NAMED non-nil, check the named node at point. (parent-is TYPE) Check that the parent has type TYPE. (node-is TYPE) Checks that the node has type TYPE. (parent-match PATTERN) Checks that the parent matches PATTERN, a query pattern. (node-match PATTERN) Checks that the node matches PATTERN, a query pattern. ANCHOR: first-child Find the first child of the parent. parent Find the parent. prev-sibling Find node's previous sibling. no-indent Do nothing. prev-line Find the named node on previous line. This can be used when indenting an empty line: just indent like the previous node. An example of using these facility can be found in ts-c-tree-sitter-indent-rules. For example, ((match nil "function_definition" "body") parent 0) means “match the node which it’s parent’s type is “function_definition” and its field name is “body”, indent to the start of its parent. That indents the starting braces in int main () { } ((parent-is "call_expression") parent 2) Means “match the node which its’ parent’s type is “call_expression”, and indent to the start of its parent + 2. That indents the second line in my_cool_function (arg1, arg2, arg3) I’ve implemented some indentation rules for C in ts-c-mode as usual. I expect someone more knowledgeable in C to actually implement it later. So… do you think this is ok, or convoluted? In particular, is there a better way to implement those “presets”? I don’t want to define them as normal functions, because then their name will be super long (parent-is -> tree-sitter-simple-indent-parent-is) and annoying to use when writing rules, but putting them in an alist (tree-sitter-simple-indent-presets) is a bit ad-hoc. I call these presets with tree-sitter--simple-apply, which basically looks up tree-sitter-simple-indent-presets, get the function and apply it. You can find the latest version at https://github.com/casouri/emacs/tree/ts I.e., git clone https://github.com/casouri/emacs.git --branch ts Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-08-17 6:18 ` Yuan Fu @ 2021-08-18 18:27 ` Stephen Leake 2021-08-18 21:30 ` Yuan Fu 2021-08-23 6:51 ` Yuan Fu 2021-08-22 2:43 ` Yuan Fu 2021-08-25 0:21 ` Stefan Monnier 2 siblings, 2 replies; 370+ messages in thread From: Stephen Leake @ 2021-08-18 18:27 UTC (permalink / raw) To: Yuan Fu Cc: Eli Zaretskii, Theodor Thornhill, Stefan Monnier, Clément Pit-Claudel, emacs-devel This looks very interesting, but I have a migraine right now, so I'll have to look at it later. You could try writing indent rules for Ada; current ada-mode code is in https://savannah.nongnu.org/git/?group=ada-mode. See the test/ directory for examples of known good indentation. ada-mode takes the approach of embedding the indent rules directly in the grammar, and the functions that do that provide a few more options than yours. To see the definition of those functions, you'll have to install the wisi package, and look in wisi.info, section Grammar actions. (it would be nice if that info/html file was linked from the GNU ELPA package page; I'll start a new thread for that). Yuan Fu <casouri@gmail.com> writes: >> >> I'm thinking of rules specified via a function that takes a TS node >> (from which the function can explore the rest of the TS tree) and return >> the indentation to use, represented as a pair (POSITION . OFFSET) >> (meaning to indent OFFSET columns further than the column position of >> POSITION). >> >> The infrastructure would limit itself to making sure we have an uptodate >> tree (computed from a properly widened buffer), find the node >> corresponding to point pass it to the function and then turn the return >> value into an actual column and indent the text accordingly (paying >> attention to the usual difference between when point is "within the >> indentation" vs "within the text”). > > Okay, here is the (ad-hoc) infrastructure I came up with: > > We have a tree-sitter-simple-indent-function. Major-mode authors can set indent-line-function to it to use the simple-indent system. tree-sitter-simple-indent-function indents according to tree-sitter-simple-indent-rules. Doc string of tree-sitter-simple-indent-rules reads: > > A list of indent rule settings. > Each indent rule setting should be (LANGUAGE . RULES), > where LANGUAGE is a language symbol, and RULES is a list of > (MATCHER ANCHOR OFFSET). > > MATCHER determines whether this rule applies, ANCHOR and OFFSET > together determines which column to indent to. > > A MATCHER is a function that takes three arguments (NODE PARENT > BOL). NODE is the largest (highest-in-tree) node starting at > point. PARENT is the parent of NODE. BOL is the point where we > are indenting: the beginning of line content, the position of the > first non-whitespace character. > > If MATCHER returns non-nil, meaning the rule matches, Emacs then > uses ANCHOR to find an anchor, it should be a function that takes > the same argument (NODE PARENT BOL) and returns a point. > > Finally Emacs computes the column of that point returned by ANCHOR > and adds OFFSET to it, and indent the line to that column. > > For MATCHER and ANCHOR, Emacs provides some convenient presets. > See `tree-sitter-simple-indent-presets’. > > And doc string for tree-sitter-simple-indent-presets: > > A list of presets. > These presets can be used as MATHER and ANCHOR in > `tree-sitter-simple-indent-rules'. > > MATCHER: > > (match NODE-TYPE PARENT-TYPE NODE-FIELD NODE-INDEX-MIN NODE-INDEX-MAX) > > NODE-TYPE checks for node's type, PARENT-TYPE check for > parent's type, NODE-FIELD checks for the filed name of node > in the parent, NODE-INDEX-MIN and NODE-INDEX-MAX checks for > the node's index in the parent. Therefore, to match the > first child where parent is \"argument_list\", use (match nil > \"argument_list\" nil nil 0 0). > > no-node > > Matches the case where node is nil, i.e., there is no node > that starts at point. This is the case when indenting an > empty line. > > (node-at-point TYPE NAMED) > > Check that the node at point -- not the largest node starting at > point -- has type TYPE. If NAMED non-nil, check the named node > at point. > > (parent-is TYPE) > > Check that the parent has type TYPE. > > (node-is TYPE) > > Checks that the node has type TYPE. > > (parent-match PATTERN) > > Checks that the parent matches PATTERN, a query pattern. > > (node-match PATTERN) > > Checks that the node matches PATTERN, a query pattern. > > ANCHOR: > > first-child > > Find the first child of the parent. > > parent > > Find the parent. > > prev-sibling > > Find node's previous sibling. > > no-indent > > Do nothing. > > prev-line > > Find the named node on previous line. This can be used when > indenting an empty line: just indent like the previous node. > > An example of using these facility can be found in ts-c-tree-sitter-indent-rules. > > For example, > > ((match nil "function_definition" "body") parent 0) > > means “match the node which it’s parent’s type is “function_definition” and its field name is “body”, indent to the start of its parent. That indents the starting braces in > > int main () > { > } > > ((parent-is "call_expression") parent 2) > > Means “match the node which its’ parent’s type is “call_expression”, and indent to the start of its parent + 2. That indents the second line in > > my_cool_function > (arg1, arg2, arg3) > > I’ve implemented some indentation rules for C in ts-c-mode as usual. I expect someone more knowledgeable in C to actually implement it later. > > So… do you think this is ok, or convoluted? In particular, is there a better way to implement those “presets”? I don’t want to define them as normal functions, because then their name will be super long (parent-is -> tree-sitter-simple-indent-parent-is) and annoying to use when writing rules, but putting them in an alist (tree-sitter-simple-indent-presets) is a bit ad-hoc. I call these presets with tree-sitter--simple-apply, which basically looks up tree-sitter-simple-indent-presets, get the function and apply it. > > You can find the latest version at https://github.com/casouri/emacs/tree/ts > I.e., git clone https://github.com/casouri/emacs.git --branch ts > > Yuan > > -- -- Stephe ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-08-18 18:27 ` Stephen Leake @ 2021-08-18 21:30 ` Yuan Fu 2021-08-20 0:12 ` [SPAM UNSURE] " Stephen Leake 2021-08-23 6:51 ` Yuan Fu 1 sibling, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-08-18 21:30 UTC (permalink / raw) To: Stephen Leake Cc: Eli Zaretskii, Theodor Thornhill, Stefan Monnier, Clément Pit-Claudel, emacs-devel > > This looks very interesting, but I have a migraine right now, so I'll > have to look at it later. Hope you get better soon :-) > You could try writing indent rules for Ada; current ada-mode code is in > https://savannah.nongnu.org/git/?group=ada-mode. See the test/ directory > for examples of known good indentation. > > ada-mode takes the approach of embedding the indent rules directly in > the grammar, and the functions that do that provide a few more options > than yours. To see the definition of those functions, you'll have to > install the wisi package, and look in wisi.info, section Grammar > actions. (it would be nice if that info/html file was linked from the > GNU ELPA package page; I'll start a new thread for that). Thanks. I’ll see what I can do; I know nearly nothing about Ada except that it is commissioned by the department of defense :-) BTW, while I was reading the manual, I noticed a typo: If token labels are used in a right hand side, they must be given explicitly in the indent arguments, using he lisp "cons" ^ syntax. Labels are normally only used with EBNF grammars, which expand into multiple right hand sides, with optional tokens simply left out. Explicit labels on the indent arguments allow them to be left out as well. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: [SPAM UNSURE] Re: Tree-sitter api 2021-08-18 21:30 ` Yuan Fu @ 2021-08-20 0:12 ` Stephen Leake 0 siblings, 0 replies; 370+ messages in thread From: Stephen Leake @ 2021-08-20 0:12 UTC (permalink / raw) To: Yuan Fu Cc: Eli Zaretskii, Theodor Thornhill, Stefan Monnier, Clément Pit-Claudel, emacs-devel Yuan Fu <casouri@gmail.com> writes: >> You could try writing indent rules for Ada; current ada-mode code is in >> https://savannah.nongnu.org/git/?group=ada-mode. See the test/ directory >> for examples of known good indentation. >> >> ada-mode takes the approach of embedding the indent rules directly in >> the grammar, and the functions that do that provide a few more options >> than yours. To see the definition of those functions, you'll have to >> install the wisi package, and look in wisi.info, section Grammar >> actions. (it would be nice if that info/html file was linked from the >> GNU ELPA package page; I'll start a new thread for that). > > Thanks. I’ll see what I can do; I know nearly nothing about Ada except > that it is commissioned by the department of defense :-) Was, a long time ago. Now it is used by high-security, high-reliability applications (train control, spacecraft (European, not NASA, sigh), banks). AdaCore is a company thriving on the business model of selling support for the Gnu Ada compiler and associated tools. > BTW, while I was reading the manual, I noticed a typo: Thanks. -- -- Stephe ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-08-18 18:27 ` Stephen Leake 2021-08-18 21:30 ` Yuan Fu @ 2021-08-23 6:51 ` Yuan Fu 2021-08-24 14:59 ` [SPAM UNSURE] " Stephen Leake 2021-08-24 22:51 ` Stefan Monnier 1 sibling, 2 replies; 370+ messages in thread From: Yuan Fu @ 2021-08-23 6:51 UTC (permalink / raw) To: Stephen Leake Cc: Eli Zaretskii, Theodor Thornhill, Stefan Monnier, Clément Pit-Claudel, emacs-devel > > ada-mode takes the approach of embedding the indent rules directly in > the grammar, and the functions that do that provide a few more options > than yours. To see the definition of those functions, you'll have to > install the wisi package, and look in wisi.info, section Grammar > actions. (it would be nice if that info/html file was linked from the > GNU ELPA package page; I'll start a new thread for that). I had a cursory look at the manual for indent in wisi and have some questions. Why does wisi indent from “low-level productions”? (I think most indentation engine works line-by-line from the first line.) I don’t know much about how wisi works, but the indentation system seems to stem from circumstances quite different from that of tree-sitter. For example, wiki’s indent is devised alongside the grammar definition, while for tree-sitter, all the hard work of defining grammar is done for me and I’m merely a user of the grammar: that makes indenting with tree-sitter a much simpler job. A problem I have with smie (and maybe wisi, but I didn’t look into wisi) is its seeming complexity. I’m merely a 22-year-old who drank too much coca-cola, and smie is too complicated for my soaked brain to comprehend. Having a traumatized experience trying to use smie[1], I want to make my indentation system as straightforward as possible. It doesn’t have to be complicated anyway, since it does so much less than wisi and smie. Right now I’d say it’s pretty simple, and most tasks (in indenting C) can be reasonably done, and I imagine difficult cases can be solved by writing custom matcher and anchor functions. Stefan, can you have a look at tree-sitter-simple-indent? It’s like two messages up? It goes generally along the (pos . offset) idea but has some twists. [1] Of course, I need to define the grammar when using smie while not when using tree-sitter, so it’s like comparing apple to pears, but I can’t resist finally telling a joke on the list. Thanks, Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: [SPAM UNSURE] Re: Tree-sitter api 2021-08-23 6:51 ` Yuan Fu @ 2021-08-24 14:59 ` Stephen Leake 2021-08-27 5:18 ` [SPAM UNSURE] " Yuan Fu 2021-08-24 22:51 ` Stefan Monnier 1 sibling, 1 reply; 370+ messages in thread From: Stephen Leake @ 2021-08-24 14:59 UTC (permalink / raw) To: Yuan Fu Cc: Eli Zaretskii, Theodor Thornhill, Stefan Monnier, Clément Pit-Claudel, emacs-devel Yuan Fu <casouri@gmail.com> writes: >> >> ada-mode takes the approach of embedding the indent rules directly in >> the grammar, and the functions that do that provide a few more options >> than yours. To see the definition of those functions, you'll have to >> install the wisi package, and look in wisi.info, section Grammar >> actions. (it would be nice if that info/html file was linked from the >> GNU ELPA package page; I'll start a new thread for that). > > I had a cursory look at the manual for indent in wisi and have some > questions. Why does wisi indent from “low-level productions”? The indent of every new-line must be specified; low level productions can contain new-lines. > (I think most indentation engine works line-by-line from the first > line.) I don’t know much about how wisi works, but the indentation > system seems to stem from circumstances quite different from that of > tree-sitter. For example, wiki’s indent is devised alongside the > grammar definition, while for tree-sitter, all the hard work of > defining grammar is done for me and I’m merely a user of the grammar: > that makes indenting with tree-sitter a much simpler job. The Ada grammar is taken from the Ada Reference Manual; the indent information is added after. The indent information could be in a separate file, as in tree-sitter (wisitoken does not currently support this; there would need to be a way to specify which production the indent rule is associated with). A tree-sitter based indent engine still has to specify the indent of every new-line; it's the same amount of information. Taking the examples from your email: > ((match nil "function_definition" "body") parent 0) > means “match the node which it’s parent’s type is > “function_definition” and its field name is “body”, indent to the > start of its parent. That indents the starting braces in > int main () > { > } Refering to the tree-sitter-c grammar at https://github.com/tree-sitter/tree-sitter-c/blob/master/grammar.js, there is a C grammar production (in tree-sitter syntax): function_definition: $ => seq( optional($.ms_call_modifier), $._declaration_specifiers, field('declarator', $._declarator), field('body', $.compound_statement) ), In wisitoken syntax, this is: function_definition : [ms_call_modifier] declaration_specifiers declarator=declarator body=compound_statement (the current wisi user guide does not define the "=" syntax for declaring token names, but it is supported; I'll add it to the user guide) The indent rule specifies the indent of the field named 'body', relative to the start of the production. So in wisitoken, this would specify one component of the indent action for this production: {(wisi-indent-action [nil nil nil (body . 0)])} Presumably there are other rules that specify the indent of the other tokens in that production, so they would not be 'nil', which in wisitoken means "undefined"; it is an error for any new-line to have an undefined indent after all indent actions are applied. Next example: ((parent-is "call_expression") parent 2) The production is: call_expression: $ => prec(PREC.CALL, seq( field('function', $._expression), field('arguments', $.argument_list) )), In wisitoken syntax (note that wisitoken does not support precedence declarations (yet)): call_expression : function=expression arguments=argument_list {(wisi-indent-action [nil (arguments . 2)])} So your syntax for indent is much more verbose than the wisi syntax (because each token gets a separate rule), but specifies the same information. Your syntax also requires naming each token that is referenced in an indent rule; wisitoken can use token position to do that, which is the main reason indent is specified directly in the grammar file; it's very easy to associate each indent expression with the corresponding token, without having to make up names for the tokens. Here are the above wisitoken productions without the token names: function_definition : [ms_call_modifier] declaration_specifiers declarator compound_statement {(wisi-indent-action [nil nil nil 0])} call_expression : expression argument_list {(wisi-indent-action [nil 2])} To be fair, we'd have to look at the other types of rules, to see if this pattern holds up. I think you were biased by the "matching" rules tree-sitter supports. That approach is reasonable when you only want to specify information for a few nodes in the tree. Wisi assumes you want to specify indent information for most of the nodes in the tree, so it supports a tree-traversal model instead. Tree-sitter does support tree traversal, but doesn't provide an easy way to add information for each node, as the wisi indent-action syntax does. -- -- Stephe ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: [SPAM UNSURE] Tree-sitter api 2021-08-24 14:59 ` [SPAM UNSURE] " Stephen Leake @ 2021-08-27 5:18 ` Yuan Fu 2021-08-31 0:48 ` Stephen Leake 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-08-27 5:18 UTC (permalink / raw) To: Stephen Leake Cc: Eli Zaretskii, Theodor Thornhill, Stefan Monnier, Clément Pit-Claudel, emacs-devel Thank you very much for spending time on this :-) > On Aug 24, 2021, at 7:59 AM, Stephen Leake <stephen_leake@stephe-leake.org> wrote: > > Yuan Fu <casouri@gmail.com> writes: > >>> >>> ada-mode takes the approach of embedding the indent rules directly in >>> the grammar, and the functions that do that provide a few more options >>> than yours. To see the definition of those functions, you'll have to >>> install the wisi package, and look in wisi.info, section Grammar >>> actions. (it would be nice if that info/html file was linked from the >>> GNU ELPA package page; I'll start a new thread for that). >> >> I had a cursory look at the manual for indent in wisi and have some >> questions. Why does wisi indent from “low-level productions”? > > The indent of every new-line must be specified; low level productions > can contain new-lines. Ah, I see, what I did is to find the “largest” node that starts at BOL, and try to match that. IIUC, wisi starts from the “smallest” entity, and goes up (by getting its parent repeatedly) until there is a non-nil indent rule for it? [snip] > So your syntax for indent is much more verbose than the wisi syntax > (because each token gets a separate rule), but specifies the same > information. > > Your syntax also requires naming each token that is referenced in an > indent rule; wisitoken can use token position to do that, which is the > main reason indent is specified directly in the grammar file; it's very > easy to associate each indent expression with the corresponding token, > without having to make up names for the tokens. > Here are the above > wisitoken productions without the token names: > > function_definition : [ms_call_modifier] declaration_specifiers > declarator compound_statement > {(wisi-indent-action [nil nil nil 0])} > > call_expression : expression argument_list > {(wisi-indent-action [nil 2])} > > To be fair, we'd have to look at the other types of rules, to see if > this pattern holds up. I tried and all rules can be translated into wisi’s style. However, it ends up as verbose as the previous one. My idea is to write out match patterns (similar to that in wisi) and give names to the interesting ones (so we use names as opposed to position). Then, if any matched node happens to be the node at point, use that node’s corresponding indent rule to indent. And in the indent rule, we can refer to other matched nodes. For example, in the indent rule of list_rest, the anchor is list_first. Maybe there are better ways to implement this, but at its current stage I don’t think this is better than tree-sitter-simple-indent. I think part of the reason why wisi’s indent rule can be succinct is that it is written along the grammar definition. It is hard to make tree-sitter’s indent rule as succinct while being easy to understand. (defvar tree-sitter-query-indent-rules '((tree-sitter-c "(function_definition body: (_) @body) (field_declaration_list) @field_decl (call_expression (_) @call_child) (if_statement (condition) @if_cond (consequence) @if_cons (alternative) @if_alt \"else\" @else) (switch_statement (condition) @switch_cond) (case_statement (_) @case-child) @case (compound_statement) @lbracket \"}\" @rbracket (compound_statement . (_) @list_first (_)* @list_rest) (initializer_list . (_) @list_first (_)* @list_rest) (argument_list . (_) @list_first (_)* @list_rest) (parameter_list . (_) @list_first (_)* @list_rest) (field_declaration_list . (_) @list_first (_)* @list_rest) " (body parent 0) (field_decl parent 0) (call_child parent 2) (if_cond parent 2) (if_cons parent 2) (if_alt parent 2) (switch_cond parent 2) (else parent 0) (case parent 0) (case-child parent 2) (lbracket parent 2) (rbracket parent 0) (list_first parent 2) (list_rest list_first 0))) "A list of indent rule settings. Each indent rule setting should be (LANGUAGE PATTERN INDENT INDENT...) where LANGUAGE is a language symbol, PATTERN is a query pattern string, and each INDENT is a list (CAPTURE_NAME ANCHOR OFFSET) If a captured node matches with the node at point, Emacs looks for an INDENT that has a matching CAPTURE_NAME, and use the ANCHOR and OFFSET of that INDENT to indent the current line. ANCHOR should be a capture name, this capture name should capture another node in PATTERN. Emacs finds the column of that node, adds OFFSET to it, and indent the current line to that column. TODO: examples in manual") > > I think you were biased by the "matching" rules tree-sitter supports. > That approach is reasonable when you only want to specify information > for a few nodes in the tree. Wisi assumes you want to specify indent > information for most of the nodes in the tree, so it supports a > tree-traversal model instead. I assumed that the indent rule for most nodes would be something basic, like “same as previous line”, and we only need to specify indent rules for some “special” nodes. IIUC, this tree-traversal method that you mentioned is like going bottom-up, and (in tree-sitter terms) match on each level, and accumulate indent delta for each matched indent rule, is that right? Does wisi go all the way up to top-level? > Tree-sitter does support tree traversal, > but doesn't provide an easy way to add information for each node, as the > wisi indent-action syntax does. Yes, I would still need to use a match pattern and name each node that I want to specify an indent delta for. There is no way to specify indent by position in the match pattern without naming each node. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: [SPAM UNSURE] Tree-sitter api 2021-08-27 5:18 ` [SPAM UNSURE] " Yuan Fu @ 2021-08-31 0:48 ` Stephen Leake 0 siblings, 0 replies; 370+ messages in thread From: Stephen Leake @ 2021-08-31 0:48 UTC (permalink / raw) To: Yuan Fu Cc: Eli Zaretskii, Theodor Thornhill, Stefan Monnier, Clément Pit-Claudel, emacs-devel Yuan Fu <casouri@gmail.com> writes: > Thank you very much for spending time on this :-) And thank you for the same; always helpful to have different points of view. >> The indent of every new-line must be specified; low level productions >> can contain new-lines. > > Ah, I see, what I did is to find the “largest” node that starts at > BOL, and try to match that. IIUC, wisi starts from the “smallest” > entity, and goes up (by getting its parent repeatedly) until there is > a non-nil indent rule for it? That's almost right. The indent rule for each production is applied while walking the entire syntax tree in depth-first order. >> To be fair, we'd have to look at the other types of rules, to see if >> this pattern holds up. > > I tried and all rules can be translated into wisi’s style. Ok. > However, it ends up as verbose as the previous one. My idea is to > write out match patterns (similar to that in wisi) and give names to > the interesting ones (so we use names as opposed to position). Then, > if any matched node happens to be the node at point, use that node’s > corresponding indent rule to indent. And in the indent rule, we can > refer to other matched nodes. For example, in the indent rule of > list_rest, the anchor is list_first. > > Maybe there are better ways to implement this, but at its current > stage I don’t think this is better than tree-sitter-simple-indent. Ok. > I think part of the reason why wisi’s indent rule can be succinct is > that it is written along the grammar definition. It is hard to make > tree-sitter’s indent rule as succinct while being easy to understand. Right. > IIUC, this tree-traversal method that you mentioned is like going > bottom-up, and (in tree-sitter terms) match on each level, and > accumulate indent delta for each matched indent rule, is that right? Yes. > Does wisi go all the way up to top-level? Yes; the top-level rule says the indent of every line defaults to 0; that covers any remaining 'nil' values. I have not tried to make this part of ada-mode incremental yet (ie, only visit changed nodes). I'm not sure that's possible. -- -- Stephe ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-08-23 6:51 ` Yuan Fu 2021-08-24 14:59 ` [SPAM UNSURE] " Stephen Leake @ 2021-08-24 22:51 ` Stefan Monnier 1 sibling, 0 replies; 370+ messages in thread From: Stefan Monnier @ 2021-08-24 22:51 UTC (permalink / raw) To: Yuan Fu Cc: Stephen Leake, Eli Zaretskii, Theodor Thornhill, Clément Pit-Claudel, emacs-devel > (I think most indentation engine works line-by-line from the > first line.) FWIW, the vast majority of the code performing indentation in the various major modes in Emacs does it by parsing backward from the position of point and doesn't work "line by line". The "line by line" is only used for `indent-region` but the workhorse function is in `indent-line-function` and only performs indentation of a single line without touching anything else. IOW, `indent-region` will usually go "line-by-line" but for each line the actual work will be by parsing backward from that line (i.e. re-parsing the previous lines that had just been parsed for the previous line's indentation). This is obviously not ideal in terms of efficiency, but in practice indenting a single line usually only needs to parse a small number of lines (I suspect it's almost O(1) of *amortized* complexity so in most cases the algorithmic complexity of `indent-region` is not really affected). > Stefan, can you have a look at tree-sitter-simple-indent? It’s like two > messages up? It goes generally along the (pos . offset) idea but has > some twists. It's in my todo list, yes. I'm still backlog'd, tho. Stefan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-08-17 6:18 ` Yuan Fu 2021-08-18 18:27 ` Stephen Leake @ 2021-08-22 2:43 ` Yuan Fu 2021-08-22 3:46 ` Yuan Fu 2021-08-22 6:15 ` Eli Zaretskii 2021-08-25 0:21 ` Stefan Monnier 2 siblings, 2 replies; 370+ messages in thread From: Yuan Fu @ 2021-08-22 2:43 UTC (permalink / raw) To: Stefan Monnier Cc: Stephen Leake, Eli Zaretskii, Theodor Thornhill, Clément Pit-Claudel, emacs-devel I’m trying to automate building the dynamic modules for each language definition. The files are largely identical for each language, I just need to replace language-specific names in each file. Can I use awk? Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-08-22 2:43 ` Yuan Fu @ 2021-08-22 3:46 ` Yuan Fu 2021-08-22 6:16 ` Eli Zaretskii 2021-08-22 6:15 ` Eli Zaretskii 1 sibling, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-08-22 3:46 UTC (permalink / raw) To: Stefan Monnier Cc: Stephen Leake, Eli Zaretskii, Theodor Thornhill, Clément Pit-Claudel, emacs-devel > On Aug 21, 2021, at 7:43 PM, Yuan Fu <casouri@gmail.com> wrote: > > I’m trying to automate building the dynamic modules for each language definition. The files are largely identical for each language, I just need to replace language-specific names in each file. Can I use awk? Actually, after searching for a bit more, I think what I need is sed. Or there are better tools that I don’t know about? Maybe I can just use emacs --batch? Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-08-22 3:46 ` Yuan Fu @ 2021-08-22 6:16 ` Eli Zaretskii 0 siblings, 0 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-08-22 6:16 UTC (permalink / raw) To: Yuan Fu; +Cc: stephen_leake, cpitclaudel, theo, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Sat, 21 Aug 2021 20:46:45 -0700 > Cc: Theodor Thornhill <theo@thornhill.no>, > Eli Zaretskii <eliz@gnu.org>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Stephen Leake <stephen_leake@stephe-leake.org>, > emacs-devel <emacs-devel@gnu.org> > > Actually, after searching for a bit more, I think what I need is sed. Or there are better tools that I don’t know about? Maybe I can just use emacs --batch? Both are possible, but if what you need to do must be done as part of bootstrap, Emacs might not be available yet. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-08-22 2:43 ` Yuan Fu 2021-08-22 3:46 ` Yuan Fu @ 2021-08-22 6:15 ` Eli Zaretskii 1 sibling, 0 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-08-22 6:15 UTC (permalink / raw) To: Yuan Fu; +Cc: stephen_leake, cpitclaudel, theo, monnier, emacs-devel > From: Yuan Fu <casouri@gmail.com> > Date: Sat, 21 Aug 2021 19:43:31 -0700 > Cc: Theodor Thornhill <theo@thornhill.no>, > Eli Zaretskii <eliz@gnu.org>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Stephen Leake <stephen_leake@stephe-leake.org>, > emacs-devel <emacs-devel@gnu.org> > > I’m trying to automate building the dynamic modules for each language definition. The files are largely identical for each language, I just need to replace language-specific names in each file. Can I use awk? Yes. We already use Awk in a couple of places in the build process. Another possibility is to use Emacs, if what you need to do is not part of bootstrap. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-08-17 6:18 ` Yuan Fu 2021-08-18 18:27 ` Stephen Leake 2021-08-22 2:43 ` Yuan Fu @ 2021-08-25 0:21 ` Stefan Monnier 2021-08-27 5:45 ` Yuan Fu 2 siblings, 1 reply; 370+ messages in thread From: Stefan Monnier @ 2021-08-25 0:21 UTC (permalink / raw) To: Yuan Fu Cc: Theodor Thornhill, Eli Zaretskii, Clément Pit-Claudel, Stephen Leake, emacs-devel > Okay, here is the (ad-hoc) infrastructure I came up with: It's more than what I proposed, but it looks fairly good. See patch below which is the "side effect" of reading your code. You'll see that I removed the "-function" from the function name (this suffix is used for variables holding functions rather than for the function themselves) and I split that function into two, the outer one (tree-sitter-indent) implementing basically what I suggested and the inner one (tree-sitter-simple-indent) implementing the extra structure you added to it, mediated by a new var `tree-sitter-indent-function` which modes can set if they want to use another algorithm than the one you implemented in `tree-sitter-simple-indent`. The reason why I divided it this way is that my experience with indentation code is that it can be useful occasionally to call recursively the indentation code to know where a node *would* be indented. This comes in handy when you want to be able to provide indentation styles like: let myvariable = if (foo) { bar } else { baz } where the body of the `if` branches needs to be indented relative to the position where the `if` itself would be indented if it were on its own line. Stefan PS: The patch also adds some space before open-paren-in-column-0-in-strings to circumvent some problems with outline-minor-mode incorrectly thinking those open-parens correspond to actual top-level definitions :-( diff --git a/lisp/tree-sitter.el b/lisp/tree-sitter.el index 83aa2d0d123..2c5d103c42d 100644 --- a/lisp/tree-sitter.el +++ b/lisp/tree-sitter.el @@ -52,6 +52,8 @@ tree-sitter-should-enable-p ;;; Parser API supplement +(defvar tree-sitter-parser-list) + (defun tree-sitter-get-parser (language) "Find the first parser using LANGUAGE in `tree-sitter-parser-list'." (catch 'found @@ -196,7 +198,7 @@ tree-sitter-simple-indent-rules "A list of indent rule settings. Each indent rule setting should be (LANGUAGE . RULES), where LANGUAGE is a language symbol, and RULES is a list of -(MATCHER ANCHOR OFFSET). + (MATCHER ANCHOR OFFSET). MATCHER determines whether this rule applies, ANCHOR and OFFSET together determines which column to indent to. @@ -289,7 +291,7 @@ tree-sitter-simple-indent-presets MATCHER: -(match NODE-TYPE PARENT-TYPE NODE-FIELD NODE-INDEX-MIN NODE-INDEX-MAX) + (match NODE-TYPE PARENT-TYPE NODE-FIELD NODE-INDEX-MIN NODE-INDEX-MAX) NODE-TYPE checks for node's type, PARENT-TYPE check for parent's type, NODE-FIELD checks for the filed name of node @@ -304,25 +306,25 @@ tree-sitter-simple-indent-presets that starts at point. This is the case when indenting an empty line. -(node-at-point TYPE NAMED) + (node-at-point TYPE NAMED) Check that the node at point -- not the largest node at point, has type TYPE. If NAMED non-nil, check the named node at point. -(parent-is TYPE) + (parent-is TYPE) Check that the parent has type TYPE. -(node-is TYPE) + (node-is TYPE) Checks that the node has type TYPE. -(parent-match PATTERN) + (parent-match PATTERN) Checks that the parent matches PATTERN, a query pattern. -(node-match PATTERN) + (node-match PATTERN) Checks that the node matches PATTERN, a query pattern. @@ -356,7 +358,7 @@ tree-sitter--simple-apply If FN is a key in `tree-sitter-simple-indent-presets', use the corresponding value as the function." - (cond ((consp fn) + (cond ((consp fn) ;FIXME: This will mis-match for non-compiled lambdas! (apply (tree-sitter--simple-apply (car fn) (cdr fn)) args)) ((and (symbolp fn) @@ -366,21 +368,46 @@ tree-sitter--simple-apply ((functionp fn) (apply fn args)) (t (error "Couldn't find appropriate function for FN")))) -(defun tree-sitter-simple-indent-function () +(defvar tree-sitter-indent-function #'tree-sitter-simple-indent + "Document.") + +(defun tree-sitter-indent () "Indent according to `tree-sitter-simple-indent-rules'." - (let* ((orig-pos (point)) - (bol (save-excursion + (pcase-let* + ((orig-pos (point)) + (bol (save-excursion + (beginning-of-line) + (skip-chars-forward " \t") + (point))) + (node (tree-sitter-parent-while + (cl-loop for parser in tree-sitter-parser-list + for node = (tree-sitter-node-at + bol nil parser) + if node return node) + (lambda (node) + (eq bol (tree-sitter-node-start node))))) + (parent (tree-sitter-node-parent node)) + (`(,anchor . ,offset) + (funcall tree-sitter-indent-function node parent))) + (let ((col (+ (save-excursion + (goto-char anchor) + (current-column)) + offset))) + (if (< bol orig-pos) + (save-excursion + (indent-line-to col)) + (indent-line-to col)) + (when tree-sitter--indent-verbose + (message "indent to %S (%S position + %S)" + col anchor offset))))) + +(defun tree-sitter-simple-indent (node parent) + (let* ((bol (save-excursion (beginning-of-line) (skip-chars-forward " \t") (point))) - (node (tree-sitter-parent-while - (cl-loop for parser in tree-sitter-parser-list - for node = (tree-sitter-node-at - bol nil parser) - if node return node) - (lambda (node) - (eq bol (tree-sitter-node-start node))))) - (parent (tree-sitter-node-parent node)) + ;; FIXME: Can't we get the language from `node' rather than + ;; from `point'? (language (tree-sitter-language-at (point))) (rules (alist-get language tree-sitter-simple-indent-rules))) (cl-loop for rule in rules @@ -388,20 +415,9 @@ tree-sitter-simple-indent-function for anchor = (nth 1 rule) for offset = (nth 2 rule) if (tree-sitter--simple-apply pred (list node parent bol)) - do (let ((col (+ (save-excursion - (goto-char - (tree-sitter--simple-apply - anchor (list node parent bol))) - (current-column)) - offset))) - (if (< bol orig-pos) - (save-excursion - (indent-line-to col)) - (indent-line-to col)) - (when tree-sitter--indent-verbose - (message "matched %S\nindent to %s" - pred col))) - and return nil))) + do `(,(tree-sitter--simple-apply + anchor (list node parent bol)) + . ,offset)))) ;;; Lab @@ -435,7 +451,7 @@ ts-c-mode (ignore t nil nil nil) indent-line-function - #'tree-sitter-simple-indent-function + #'tree-sitter-indent tree-sitter-simple-indent-rules ts-c-tree-sitter-indent-rules) ^ permalink raw reply related [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-08-25 0:21 ` Stefan Monnier @ 2021-08-27 5:45 ` Yuan Fu 2021-09-03 19:16 ` Theodor Thornhill 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-08-27 5:45 UTC (permalink / raw) To: Stefan Monnier Cc: Stephen Leake, Eli Zaretskii, Theodor Thornhill, Clément Pit-Claudel, emacs-devel > On Aug 24, 2021, at 5:21 PM, Stefan Monnier <monnier@iro.umontreal.ca> wrote: > >> Okay, here is the (ad-hoc) infrastructure I came up with: > > It's more than what I proposed, but it looks fairly good. > See patch below which is the "side effect" of reading your code. > > You'll see that I removed the "-function" from the function name (this > suffix is used for variables holding functions rather than for the > function themselves) and I split that function into two, the outer one > (tree-sitter-indent) implementing basically what I suggested and the > inner one (tree-sitter-simple-indent) implementing the extra structure > you added to it, mediated by a new var `tree-sitter-indent-function` > which modes can set if they want to use another algorithm than the one > you implemented in `tree-sitter-simple-indent`. > > The reason why I divided it this way is that my experience with > indentation code is that it can be useful occasionally to call > recursively the indentation code to know where a node *would* be > indented. This comes in handy when you want to be able to provide > indentation styles like: > > let myvariable = if (foo) { > bar > } else { > baz > } > > where the body of the `if` branches needs to be indented relative to the > position where the `if` itself would be indented if it were on its own line. Thanks, Stefan :-) I applied your patch and fixed the two FIXME’s. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-08-27 5:45 ` Yuan Fu @ 2021-09-03 19:16 ` Theodor Thornhill [not found] ` <AF64EB2C-CCEC-4C98-8FE3-37697BEC9098@gmail.com> 0 siblings, 1 reply; 370+ messages in thread From: Theodor Thornhill @ 2021-09-03 19:16 UTC (permalink / raw) To: Yuan Fu, Stefan Monnier Cc: Eli Zaretskii, Clément Pit-Claudel, Stephen Leake, emacs-devel Yuan Fu <casouri@gmail.com> writes: Hi! > > Thanks, Stefan :-) I applied your patch and fixed the two FIXME’s. > If I were to start experimenting with this in csharp-mode, how would I start? Right now we support the rust version on melpa, but I'd rather move to this core-supported package. How far are we from including this in core, and what can I do to help? All the best, Theodor Thornhill ^ permalink raw reply [flat|nested] 370+ messages in thread
[parent not found: <AF64EB2C-CCEC-4C98-8FE3-37697BEC9098@gmail.com>]
* Re: Tree-sitter api [not found] ` <AF64EB2C-CCEC-4C98-8FE3-37697BEC9098@gmail.com> @ 2021-09-04 12:49 ` Tuấn-Anh Nguyễn 2021-09-04 13:04 ` Eli Zaretskii 2021-09-04 15:31 ` Yuan Fu 2021-09-04 15:14 ` Tuấn-Anh Nguyễn 2021-09-05 21:15 ` Theodor Thornhill 2 siblings, 2 replies; 370+ messages in thread From: Tuấn-Anh Nguyễn @ 2021-09-04 12:49 UTC (permalink / raw) To: Yuan Fu Cc: Clément Pit-Claudel, Theodor Thornhill, emacs-devel, Stefan Monnier, Eli Zaretskii, Stephen Leake On Sat, Sep 4, 2021 at 1:44 PM Yuan Fu <casouri@gmail.com> wrote: > 1) tree-sitter lacks a way to change its malloc behavior in run-time, I commented on their road-map issue, but no one has replied yet, > Do you mean APIs to change its alloc/free functions at run time? Why would we need to do that? Doesn't simply defining `ts_malloc` and related functions work? See https://github.com/tree-sitter/tree-sitter/blob/v0.20.0/lib/src/alloc.h#L27. > 4) I need to work on a better way to build and distribute language dynamic modules. > I think there should be 2 mechanisms: 1. The binaries for common platforms should be built on Emacs's build infrastructure, and distributed through GNU ELPA. 2. There should be Lisp functions to download the grammar sources and compile them (by invoking the compiler). > You can find a script for building dynamic modules at https://github.com/casouri/tree-sitter-module > I may be missing something here, but the grammars' compiled forms don't need to be Emacs dynamic modules, right? They only need to be dynamically-loadable shared libraries. -- Tuấn-Anh Nguyễn Software Engineer ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-04 12:49 ` Tuấn-Anh Nguyễn @ 2021-09-04 13:04 ` Eli Zaretskii 2021-09-04 14:49 ` Tuấn-Anh Nguyễn 2021-09-04 15:31 ` Yuan Fu 1 sibling, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-09-04 13:04 UTC (permalink / raw) To: Tuấn-Anh Nguyễn Cc: casouri, theo, cpitclaudel, emacs-devel, monnier, stephen_leake > From: Tuấn-Anh Nguyễn <ubolonton@gmail.com> > Date: Sat, 4 Sep 2021 19:49:35 +0700 > Cc: Theodor Thornhill <theo@thornhill.no>, Stephen Leake <stephen_leake@stephe-leake.org>, > Eli Zaretskii <eliz@gnu.org>, Clément Pit-Claudel <cpitclaudel@gmail.com>, > Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel <emacs-devel@gnu.org> > > On Sat, Sep 4, 2021 at 1:44 PM Yuan Fu <casouri@gmail.com> wrote: > > 1) tree-sitter lacks a way to change its malloc behavior in run-time, I commented on their road-map issue, but no one has replied yet, > > > Do you mean APIs to change its alloc/free functions at run time? Why would we > need to do that? Because what TS does when it runs out of memory is call 'exit'. That's unacceptable for Emacs. Emacs can handle out-of-memory situations well enough, but it can only do that if the problem is reported to it by the memory-allocation functions. > Doesn't simply defining `ts_malloc` and related functions work? No, because we want to be able to link against a TS library, we don't want to require people who build Emacs to build TS as well. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-04 13:04 ` Eli Zaretskii @ 2021-09-04 14:49 ` Tuấn-Anh Nguyễn 2021-09-04 15:00 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Tuấn-Anh Nguyễn @ 2021-09-04 14:49 UTC (permalink / raw) To: Eli Zaretskii Cc: Yuan Fu, Theodor Thornhill, Clément Pit-Claudel, emacs-devel, Stefan Monnier, Stephen Leake On Sat, Sep 4, 2021 at 8:04 PM Eli Zaretskii <eliz@gnu.org> wrote: > No, because we want to be able to link against a TS library, we don't > want to require people who build Emacs to build TS as well. Related questions: 1. Who do we expect to build the TS library? For Linux I assume that would be the maintainer of the (system) package `libtree-sitter`. Is that correct? 2. Who do we expect to build the grammar binaries? -- Tuấn-Anh Nguyễn Software Engineer ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-04 14:49 ` Tuấn-Anh Nguyễn @ 2021-09-04 15:00 ` Eli Zaretskii 2021-09-05 16:34 ` Tuấn-Anh Nguyễn 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-09-04 15:00 UTC (permalink / raw) To: Tuấn-Anh Nguyễn Cc: casouri, theo, cpitclaudel, emacs-devel, monnier, stephen_leake > From: Tuấn-Anh Nguyễn <ubolonton@gmail.com> > Date: Sat, 4 Sep 2021 21:49:29 +0700 > Cc: Yuan Fu <casouri@gmail.com>, Theodor Thornhill <theo@thornhill.no>, > Stephen Leake <stephen_leake@stephe-leake.org>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel <emacs-devel@gnu.org> > > On Sat, Sep 4, 2021 at 8:04 PM Eli Zaretskii <eliz@gnu.org> wrote: > > No, because we want to be able to link against a TS library, we don't > > want to require people who build Emacs to build TS as well. > > Related questions: > 1. Who do we expect to build the TS library? For Linux I assume that would be > the maintainer of the (system) package `libtree-sitter`. Is that correct? The distro, I'd say. It can alwso be built on the user's machine and installed separately. Basically, the same as with any other optional library we use: libpng, harfBuzz, etc. > 2. Who do we expect to build the grammar binaries? The ones that TS already provides? They are already built, no? Or what do you mean by "build the grammar binaries", what kind of binaries are those? Forgive me my ignorance. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-04 15:00 ` Eli Zaretskii @ 2021-09-05 16:34 ` Tuấn-Anh Nguyễn 2021-09-05 16:45 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Tuấn-Anh Nguyễn @ 2021-09-05 16:34 UTC (permalink / raw) To: Eli Zaretskii Cc: Yuan Fu, Theodor Thornhill, Clément Pit-Claudel, emacs-devel, Stefan Monnier, Stephen Leake On Sat, Sep 4, 2021 at 10:00 PM Eli Zaretskii <eliz@gnu.org> wrote: > > 2. Who do we expect to build the grammar binaries? > > The ones that TS already provides? They are already built, no? Or > what do you mean by "build the grammar binaries", what kind of > binaries are those? Forgive me my ignorance. There are 2 components: TS the library (`libtree-sitter`) provides the generic parts, not the grammars. The grammars come from various repositories, in source form. (Some of them are owned by the tree-sitter project, some are not.) Each of those consists of a generated `parser.c` and an optional `scanner.{c,cc}`. They provide a function `TSLanguage (*tree_sitter_c) ()`, which specifies details on how to parse a specific language (e.g. the parse table). They are usually compiled into dynamically-loadable shared libraries (by a `tree-sitter` CLI program), and distributed separately from `libtree-sitter`. Tree-sitter has its own ABI versioning for these 2 components. It's easier to ensure ABI compatibility if they are both built by the same system. That's the case for GitHub's internal uses of tree-sitter. That's also the case with `tree-sitter` and `tree-sitter-langs` packages on MELPA. That's not the case with NeoVim's tree-sitter integration, and in a source of constant headache AFAICT. If we leave `libtree-sitter` to the distro, then it also makes sense for the distro to provide the `tree-sitter` CLI program, and/or the grammar binaries/sources. -- Tuấn-Anh Nguyễn Software Engineer ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-05 16:34 ` Tuấn-Anh Nguyễn @ 2021-09-05 16:45 ` Eli Zaretskii 0 siblings, 0 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-09-05 16:45 UTC (permalink / raw) To: Tuấn-Anh Nguyễn Cc: casouri, theo, cpitclaudel, emacs-devel, monnier, stephen_leake > From: Tuấn-Anh Nguyễn <ubolonton@gmail.com> > Date: Sun, 5 Sep 2021 23:34:59 +0700 > Cc: Yuan Fu <casouri@gmail.com>, Theodor Thornhill <theo@thornhill.no>, > Stephen Leake <stephen_leake@stephe-leake.org>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel <emacs-devel@gnu.org> > > If we leave `libtree-sitter` to the distro, then it also makes sense for the > distro to provide the `tree-sitter` CLI program, and/or the grammar > binaries/sources. Yes, of course. And users who built TS themselves, will have to build those grammar files as well. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-04 12:49 ` Tuấn-Anh Nguyễn 2021-09-04 13:04 ` Eli Zaretskii @ 2021-09-04 15:31 ` Yuan Fu 2021-09-05 16:45 ` Tuấn-Anh Nguyễn 1 sibling, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-09-04 15:31 UTC (permalink / raw) To: Tuấn-Anh Nguyễn Cc: Clément Pit-Claudel, Theodor Thornhill, emacs-devel, Stefan Monnier, Eli Zaretskii, Stephen Leake > >> You can find a script for building dynamic modules at https://github.com/casouri/tree-sitter-module >> > I may be missing something here, but the grammars' compiled forms don't need to > be Emacs dynamic modules, right? They only need to be dynamically-loadable > shared libraries. I packaged language definitions into dynamic modules: the system is there, why not take advantage of it? Do you think this approach can be improved in some way? Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-04 15:31 ` Yuan Fu @ 2021-09-05 16:45 ` Tuấn-Anh Nguyễn 2021-09-05 20:19 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Tuấn-Anh Nguyễn @ 2021-09-05 16:45 UTC (permalink / raw) To: Yuan Fu Cc: Clément Pit-Claudel, Theodor Thornhill, emacs-devel, Stefan Monnier, Eli Zaretskii, Stephen Leake On Sat, Sep 4, 2021 at 10:31 PM Yuan Fu <casouri@gmail.com> wrote: > I packaged language definitions into dynamic modules: the system is there, why not take advantage of it? Do you think this approach can be improved in some way? The language definitions just need to come from dynamically-loadable shared libraries. They don't have to be Emacs dynamic modules, which bring additional unnecessary complications, e.g. build difficulty, load path pollution, or inability to load grammar binaries from other sources like distro's package repos. It's better to just load the shared libs directly without going through module machinery. Use the functions in `dynlib.h`. -- Tuấn-Anh Nguyễn Software Engineer ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-05 16:45 ` Tuấn-Anh Nguyễn @ 2021-09-05 20:19 ` Yuan Fu 2021-09-06 0:03 ` Tuấn-Anh Nguyễn 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-09-05 20:19 UTC (permalink / raw) To: Tuấn-Anh Nguyễn Cc: Clément Pit-Claudel, Theodor Thornhill, emacs-devel, Stefan Monnier, Eli Zaretskii, Stephen Leake > On Sep 5, 2021, at 9:45 AM, Tuấn-Anh Nguyễn <ubolonton@gmail.com> wrote: > > On Sat, Sep 4, 2021 at 10:31 PM Yuan Fu <casouri@gmail.com> wrote: >> I packaged language definitions into dynamic modules: the system is there, why not take advantage of it? Do you think this approach can be improved in some way? > > The language definitions just need to come from dynamically-loadable shared > libraries. They don't have to be Emacs dynamic modules, which bring additional > unnecessary complications, e.g. build difficulty, load path pollution, or > inability to load grammar binaries from other sources like distro's package > repos. It's better to just load the shared libs directly without going through > module machinery. Use the functions in `dynlib.h`. > Dynamic modules comes with nice things, for example Emacs looks for them automatically in load-path; Emacs reports errors with it has problem loading one; I can package some additional information with the module; I could maybe distribute them through ordinary package.el facility, etc etc. If I load the shared library directly, I need to reinvent the wheels for loading, error reporting, searching in load-path, and others. On the other hand, dynamic modules don’t come with much complications. Yes, you need additional emacs-modules.h and tree-sitter-<lang>.c to build it, but that’s about it. And I was hoping to distribute pre-built modules anyway, so if all went well, ordinary users don’t need to compile the modules. WDYT? P.S. what do you mean by “load path pollution”? P.P.S. My impression is that other applications distribute language definitions by themselves, and it is not common for distort to package language definitions, is that correct? Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-05 20:19 ` Yuan Fu @ 2021-09-06 0:03 ` Tuấn-Anh Nguyễn 2021-09-06 0:23 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Tuấn-Anh Nguyễn @ 2021-09-06 0:03 UTC (permalink / raw) To: Yuan Fu Cc: Clément Pit-Claudel, Theodor Thornhill, emacs-devel, Stefan Monnier, Eli Zaretskii, Stephen Leake On Mon, Sep 6, 2021 at 3:19 AM Yuan Fu <casouri@gmail.com> wrote: > Dynamic modules comes with nice things, for example Emacs looks for them automatically in load-path; Emacs reports errors with it has problem loading one; On the other hand, dynamic modules don’t come with much complications. Yes, you need additional emacs-modules.h and tree-sitter-<lang>.c to build it, but that’s about it. See my other discussion with Eli. We want to rely on the distro to provide the binaries and the `tree-sitter` CLI program, and to be able to use shared libs from other sources as well (like self-built). They are not going to be Emacs dynamic modules. > I can package some additional information with the module; I could maybe distribute them through ordinary package.el facility, etc etc. Neither of these requires it to be a module at all. (Also note that package.el isn't able to handle platform-specific files at the moment.) > If I load the shared library directly, I need to reinvent the wheels for loading, error reporting, searching in load-path, and others. The non-module-specific part of loading is provided by `dynlib.h`. There's no wheel to reinvent here. What error reporting do you mean? (You are going to need additional checks for ABI compatibility anyway.) Searching a load path (not the `load-path`) is not that complicated. What are the others? > And I was hoping to distribute pre-built modules anyway, so if all went well, ordinary users don’t need to compile the modules. WDYT? It's good to provide that convenience, but it should not be at the expense of not being able to use binaries from other sources, or to build the binaries on their own. The `tree-sitter-langs` package already enables both of these. It provides both pre-built binaries and functions for users to compile on their own. And it does so without putting language definitions in dynamic modules. > P.S. what do you mean by “load path pollution”? I meant to say load path collision, but since you use `tree-sitter-{lang}` for the module name, that's less of a problem. Load path pollution is these names showing up when the user enumerates entries on the load path trying to go to the source of a Lisp library. That's annoying, but bearable. > P.P.S. My impression is that other applications distribute language definitions by themselves, and it is not common for distort to package language definitions, is that correct? I don't understand this. Can you rephrase it? All in all, you are severely underestimating the amount of complexity and wheels you will have to reinvent in other places compared to the amount of code you don't have to write by requiring language definitions to be in dynamic modules. (It's less than 100, most of which is docstrings and comments.) -- Tuấn-Anh Nguyễn Software Engineer ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-06 0:03 ` Tuấn-Anh Nguyễn @ 2021-09-06 0:23 ` Yuan Fu 2021-09-06 5:33 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-09-06 0:23 UTC (permalink / raw) To: Tuấn-Anh Nguyễn Cc: Clément Pit-Claudel, Theodor Thornhill, emacs-devel, Stefan Monnier, Eli Zaretskii, Stephen Leake > On Sep 5, 2021, at 5:03 PM, Tuấn-Anh Nguyễn <ubolonton@gmail.com> wrote: > > On Mon, Sep 6, 2021 at 3:19 AM Yuan Fu <casouri@gmail.com> wrote: >> Dynamic modules comes with nice things, for example Emacs looks for them automatically in load-path; Emacs reports errors with it has problem loading one; On the other hand, dynamic modules don’t come with much complications. Yes, you need additional emacs-modules.h and tree-sitter-<lang>.c to build it, but that’s about it. > > See my other discussion with Eli. We want to rely on the distro to provide the > binaries and the `tree-sitter` CLI program, and to be able to use shared libs > from other sources as well (like self-built). They are not going to be Emacs > dynamic modules. > >> I can package some additional information with the module; I could maybe distribute them through ordinary package.el facility, etc etc. > > Neither of these requires it to be a module at all. (Also note that package.el > isn't able to handle platform-specific files at the moment.) > >> If I load the shared library directly, I need to reinvent the wheels for loading, error reporting, searching in load-path, and others. > > The non-module-specific part of loading is provided by `dynlib.h`. There's no > wheel to reinvent here. What error reporting do you mean? (You are going to need > additional checks for ABI compatibility anyway.) Searching a load path (not the > `load-path`) is not that complicated. What are the others? > >> And I was hoping to distribute pre-built modules anyway, so if all went well, ordinary users don’t need to compile the modules. WDYT? > > It's good to provide that convenience, but it should not be at the expense of > not being able to use binaries from other sources, or to build the binaries on > their own. The `tree-sitter-langs` package already enables both of these. It > provides both pre-built binaries and functions for users to compile on their > own. And it does so without putting language definitions in dynamic modules. > >> P.S. what do you mean by “load path pollution”? > > I meant to say load path collision, but since you use `tree-sitter-{lang}` for > the module name, that's less of a problem. Load path pollution is these names > showing up when the user enumerates entries on the load path trying to go to the > source of a Lisp library. That's annoying, but bearable. > >> P.P.S. My impression is that other applications distribute language definitions by themselves, and it is not common for distort to package language definitions, is that correct? > > I don't understand this. Can you rephrase it? > > All in all, you are severely underestimating the amount of complexity and wheels > you will have to reinvent in other places compared to the amount of code you > don't have to write by requiring language definitions to be in dynamic modules. > (It's less than 100, most of which is docstrings and comments.) I see your point. If no one else object, I’ll change the code to use shared libraries instead of dynamic modules. Thanks for the input :-) Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-06 0:23 ` Yuan Fu @ 2021-09-06 5:33 ` Eli Zaretskii 2021-09-07 15:38 ` Tuấn-Anh Nguyễn 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-09-06 5:33 UTC (permalink / raw) To: Yuan Fu; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake > From: Yuan Fu <casouri@gmail.com> > Date: Sun, 5 Sep 2021 17:23:33 -0700 > Cc: Theodor Thornhill <theo@thornhill.no>, > Stephen Leake <stephen_leake@stephe-leake.org>, > Eli Zaretskii <eliz@gnu.org>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Stefan Monnier <monnier@iro.umontreal.ca>, > emacs-devel <emacs-devel@gnu.org> > > I see your point. If no one else object, I’ll change the code to use shared libraries instead of dynamic modules. Thanks for the input :-) Can we please stop for a moment and describe what exactly is required for loading a language module? I think it would be good to have that documented in this discussion for posterity, and so that we make sure we are all on the same page. I understand that a language module gets compiled into a shared library, either as part of building TS or separately. But what should Emacs do to "load" the module, and when should it do that? And how do we intend to handle the situation where a module is needed, but is not available (i.e. its loading fails)? Emacs has a load-on-demand infrastructure for shared libraries, but it only exists on MS-Windows, where we support installations of Emacs binaries without some of the optional libraries, and want to handle that gracefully. However, this doesn't seem to be a similar situation; for starters, load-on-demand needs to know at Emacs build time the names of entry points (functions and variables) we need to import from each shared library. So I guess we are talking about some (slightly) different mechanism here? Thanks. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-06 5:33 ` Eli Zaretskii @ 2021-09-07 15:38 ` Tuấn-Anh Nguyễn 2021-09-07 16:16 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Tuấn-Anh Nguyễn @ 2021-09-07 15:38 UTC (permalink / raw) To: Eli Zaretskii Cc: Yuan Fu, Theodor Thornhill, Clément Pit-Claudel, emacs-devel, Stefan Monnier, Stephen Leake On Mon, Sep 6, 2021 at 12:33 PM Eli Zaretskii <eliz@gnu.org> wrote: > I understand that a language module gets compiled into a shared > library, either as part of building TS or separately. But what should > Emacs do to "load" the module, and when should it do that? And how do > we intend to handle the situation where a module is needed, but is not > available (i.e. its loading fails)? Emacs should "load" the module when it's asked to do so, by a function, e.g. `tree-sitter-load-lang`. When loading fails, it should signal an error. To locate the module, I think there are 2 possible approaches: 1. Emacs consults a new search path variable to look for the module, which is named `<lang>[.ext]`, and calls `dynlib_open` with the absolute path. 2. Emacs calls `dynlib_open` with the basename `tree-sitter-<lang>[.ext]`, relying on the module being correctly put on the system's library search path, e.g. by the distro's package manager. Option 2 sounds better to me, but option 1 is how people do it at the moment. (And no distro has packaged these AFAICT.) > Emacs has a load-on-demand infrastructure for shared libraries, but it > only exists on MS-Windows, where we support installations of Emacs > binaries without some of the optional libraries, and want to handle > that gracefully. However, this doesn't seem to be a similar > situation; for starters, load-on-demand needs to know at Emacs build > time the names of entry points (functions and variables) we need to > import from each shared library. So I guess we are talking about some > (slightly) different mechanism here? For each language, the entry point is a single function `TSLanguage (*tree_sitter_<lang>) ()`, where `lang` is the name declared in the grammar's DSL source. It's ensured by the parser generator (the `tree-sitter` CLI program). -- Tuấn-Anh Nguyễn Software Engineer ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-07 15:38 ` Tuấn-Anh Nguyễn @ 2021-09-07 16:16 ` Eli Zaretskii 2021-09-08 3:06 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-09-07 16:16 UTC (permalink / raw) To: Tuấn-Anh Nguyễn Cc: casouri, theo, cpitclaudel, emacs-devel, monnier, stephen_leake > From: Tuấn-Anh Nguyễn <ubolonton@gmail.com> > Date: Tue, 7 Sep 2021 22:38:52 +0700 > Cc: Yuan Fu <casouri@gmail.com>, Theodor Thornhill <theo@thornhill.no>, > Stephen Leake <stephen_leake@stephe-leake.org>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel <emacs-devel@gnu.org> > > On Mon, Sep 6, 2021 at 12:33 PM Eli Zaretskii <eliz@gnu.org> wrote: > > I understand that a language module gets compiled into a shared > > library, either as part of building TS or separately. But what should > > Emacs do to "load" the module, and when should it do that? And how do > > we intend to handle the situation where a module is needed, but is not > > available (i.e. its loading fails)? > > Emacs should "load" the module when it's asked to do so, by a function, e.g. > `tree-sitter-load-lang`. When loading fails, it should signal an error. So this has to be an explicit load initiated by a Lisp program? How would that program know which module to load for a given language? (I thought TS would load the module it needs whenever support for a language is requested.) > To locate the module, I think there are 2 possible approaches: > 1. Emacs consults a new search path variable to look for the module, which is > named `<lang>[.ext]`, and calls `dynlib_open` with the absolute path. > 2. Emacs calls `dynlib_open` with the basename `tree-sitter-<lang>[.ext]`, > relying on the module being correctly put on the system's library search path, > e.g. by the distro's package manager. > > Option 2 sounds better to me, but option 1 is how people do it at the moment. > (And no distro has packaged these AFAICT.) I think 2 is better, since we are relying on others to build and package these modules. Thanks. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-07 16:16 ` Eli Zaretskii @ 2021-09-08 3:06 ` Yuan Fu 2021-09-10 2:06 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-09-08 3:06 UTC (permalink / raw) To: Eli Zaretskii Cc: Tuấn-Anh Nguyễn, Theodor Thornhill, Clément Pit-Claudel, emacs-devel, Stefan Monnier, Stephen Leake >> >> Emacs should "load" the module when it's asked to do so, by a function, e.g. >> `tree-sitter-load-lang`. When loading fails, it should signal an error. > > So this has to be an explicit load initiated by a Lisp program? How > would that program know which module to load for a given language? (I > thought TS would load the module it needs whenever support for a > language is requested.) TS doesn’t load the module, it expects the user to pass it a pointer to the language definition. How does the user get the language definition is not its business. The user is supposed to combine TS and a language definition to create a workable parser. See: bool ts_parser_set_language(TSParser *self, const TSLanguage *language); TS only wants a pointer to a TSLanguage. All the language modules have regular names, i.e., tree-sitter-<lang>.so, so I think we can just calculate the name from the language name; or we can add a backup: use an alist to map language names to module names to cover possible irregular names. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-08 3:06 ` Yuan Fu @ 2021-09-10 2:06 ` Yuan Fu 2021-09-10 6:32 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-09-10 2:06 UTC (permalink / raw) To: Eli Zaretskii Cc: Tuấn-Anh Nguyễn, Theodor Thornhill, Clément Pit-Claudel, emacs-devel, Stefan Monnier, Stephen Leake > On Sep 7, 2021, at 8:06 PM, Yuan Fu <casouri@gmail.com> wrote: > >>> >>> Emacs should "load" the module when it's asked to do so, by a function, e.g. >>> `tree-sitter-load-lang`. When loading fails, it should signal an error. >> >> So this has to be an explicit load initiated by a Lisp program? How >> would that program know which module to load for a given language? (I >> thought TS would load the module it needs whenever support for a >> language is requested.) > > TS doesn’t load the module, it expects the user to pass it a pointer to the language definition. How does the user get the language definition is not its business. The user is supposed to combine TS and a language definition to create a workable parser. See: > > bool ts_parser_set_language(TSParser *self, const TSLanguage *language); > > TS only wants a pointer to a TSLanguage. > > All the language modules have regular names, i.e., tree-sitter-<lang>.so, so I think we can just calculate the name from the language name; or we can add a backup: use an alist to map language names to module names to cover possible irregular names. If you think it’s fine, Eli, I’ll start working on this. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-10 2:06 ` Yuan Fu @ 2021-09-10 6:32 ` Eli Zaretskii 2021-09-10 19:57 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-09-10 6:32 UTC (permalink / raw) To: Yuan Fu; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake > From: Yuan Fu <casouri@gmail.com> > Date: Thu, 9 Sep 2021 19:06:28 -0700 > Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>, > Theodor Thornhill <theo@thornhill.no>, > Stephen Leake <stephen_leake@stephe-leake.org>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Stefan Monnier <monnier@iro.umontreal.ca>, > emacs-devel@gnu.org > > > TS doesn’t load the module, it expects the user to pass it a pointer to the language definition. How does the user get the language definition is not its business. The user is supposed to combine TS and a language definition to create a workable parser. See: > > > > bool ts_parser_set_language(TSParser *self, const TSLanguage *language); > > > > TS only wants a pointer to a TSLanguage. > > > > All the language modules have regular names, i.e., tree-sitter-<lang>.so, so I think we can just calculate the name from the language name; or we can add a backup: use an alist to map language names to module names to cover possible irregular names. > > If you think it’s fine, Eli, I’ll start working on this. Sure. I guess we will have to have a database of module names for each programming language somewhere? ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-10 6:32 ` Eli Zaretskii @ 2021-09-10 19:57 ` Yuan Fu 2021-09-11 3:41 ` Tuấn-Anh Nguyễn 2021-09-11 5:51 ` Eli Zaretskii 0 siblings, 2 replies; 370+ messages in thread From: Yuan Fu @ 2021-09-10 19:57 UTC (permalink / raw) To: Eli Zaretskii Cc: Tuấn-Anh Nguyễn, Theodor Thornhill, Clément Pit-Claudel, emacs-devel, Stefan Monnier, Stephen Leake > > Sure. I guess we will have to have a database of module names for > each programming language somewhere? My plan is to translate lisp names to C names by default, and have an override list for irregular names that can’t be translated correctly. Just realized another problem, how do we make sure the loaded library is GPL-compatible? There certainly won’t be “plugin_is_GPL_compatible” symbol in them… And IIUC Emacs cannot load GPL-incompatible dynamic libraries? Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-10 19:57 ` Yuan Fu @ 2021-09-11 3:41 ` Tuấn-Anh Nguyễn 2021-09-11 4:11 ` Yuan Fu 2021-09-11 5:51 ` Eli Zaretskii 1 sibling, 1 reply; 370+ messages in thread From: Tuấn-Anh Nguyễn @ 2021-09-11 3:41 UTC (permalink / raw) To: Yuan Fu Cc: Clément Pit-Claudel, Theodor Thornhill, emacs-devel, Stefan Monnier, Eli Zaretskii, Stephen Leake > Just realized another problem, how do we make sure the loaded library is GPL-compatible? This question is rather non-technical, so I can't provide any comments. > There certainly won’t be “plugin_is_GPL_compatible” symbol in them… And IIUC Emacs cannot load GPL-incompatible dynamic libraries? That's one of the reasons for using `dynlib.h` APIs directly. The check for that symbol is at the level of `emacs-module.c`. Let's not conceptually conflate a "shared library" and an "Emacs dynamic module". -- Tuấn-Anh Nguyễn Software Engineer ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-11 3:41 ` Tuấn-Anh Nguyễn @ 2021-09-11 4:11 ` Yuan Fu 2021-09-11 7:23 ` Tuấn-Anh Nguyễn 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-09-11 4:11 UTC (permalink / raw) To: Tuấn-Anh Nguyễn Cc: Clément Pit-Claudel, Theodor Thornhill, emacs-devel, Stefan Monnier, Eli Zaretskii, Stephen Leake > On Sep 10, 2021, at 8:41 PM, Tuấn-Anh Nguyễn <ubolonton@gmail.com> wrote: > >> Just realized another problem, how do we make sure the loaded library is GPL-compatible? > > This question is rather non-technical, so I can't provide any comments. > >> There certainly won’t be “plugin_is_GPL_compatible” symbol in them… And IIUC Emacs cannot load GPL-incompatible dynamic libraries? > > That's one of the reasons for using `dynlib.h` APIs directly. The check for > that symbol is at the level of `emacs-module.c`. Let's not conceptually > conflate a "shared library" and an "Emacs dynamic module”. I think you have it backwards. IIUC the reason why every Emacs dynamic module declares “plugin_is_GPL_compatible” is that every shared library that links with Emacs must be GPL compatible, and an Emacs dynamic module is a shared library. But that’s just my understanding, of course. I’m happy to be corrected. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-11 4:11 ` Yuan Fu @ 2021-09-11 7:23 ` Tuấn-Anh Nguyễn 2021-09-11 19:02 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Tuấn-Anh Nguyễn @ 2021-09-11 7:23 UTC (permalink / raw) To: Yuan Fu Cc: Clément Pit-Claudel, Theodor Thornhill, emacs-devel, Stefan Monnier, Eli Zaretskii, Stephen Leake > I think you have it backwards. IIUC the reason why every Emacs dynamic module declares “plugin_is_GPL_compatible” is that every shared library that links with Emacs must be GPL compatible, and an Emacs dynamic module is a shared library. But that’s just my understanding, of course. I’m happy to be corrected. That understanding is wrong. To help you understand better: every Emacs dynamic module is a shared library, but the opposite is not true. If you are still confused, read the relevant parts in `emacs-module.c`. On another note, shared libraries in general don't "link" with Emacs. "Linking" has very specific and precise technical meanings in this context. Please read up on that, starting from "dynamic linking vs. dynamic loading." -- Tuấn-Anh Nguyễn Software Engineer ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-11 7:23 ` Tuấn-Anh Nguyễn @ 2021-09-11 19:02 ` Yuan Fu 0 siblings, 0 replies; 370+ messages in thread From: Yuan Fu @ 2021-09-11 19:02 UTC (permalink / raw) To: Tuấn-Anh Nguyễn Cc: Clément Pit-Claudel, Theodor Thornhill, emacs-devel, Stefan Monnier, Eli Zaretskii, Stephen Leake > On Sep 11, 2021, at 12:23 AM, Tuấn-Anh Nguyễn <ubolonton@gmail.com> wrote: > >> I think you have it backwards. IIUC the reason why every Emacs dynamic module declares “plugin_is_GPL_compatible” is that every shared library that links with Emacs must be GPL compatible, and an Emacs dynamic module is a shared library. But that’s just my understanding, of course. I’m happy to be corrected. > > That understanding is wrong. To help you understand better: every Emacs dynamic > module is a shared library, but the opposite is not true. If you are still > confused, read the relevant parts in `emacs-module.c`. On another note, shared > libraries in general don't "link" with Emacs. "Linking" has very specific and > precise technical meanings in this context. Please read up on that, starting > from "dynamic linking vs. dynamic loading.” I see, thanks for the explanation. Anyway, I’m glad there isn’t an issue. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-10 19:57 ` Yuan Fu 2021-09-11 3:41 ` Tuấn-Anh Nguyễn @ 2021-09-11 5:51 ` Eli Zaretskii 2021-09-11 19:00 ` Yuan Fu 1 sibling, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-09-11 5:51 UTC (permalink / raw) To: Yuan Fu; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake > From: Yuan Fu <casouri@gmail.com> > Date: Fri, 10 Sep 2021 12:57:22 -0700 > Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>, > Theodor Thornhill <theo@thornhill.no>, > Stephen Leake <stephen_leake@stephe-leake.org>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Stefan Monnier <monnier@iro.umontreal.ca>, > emacs-devel@gnu.org > > > Sure. I guess we will have to have a database of module names for > > each programming language somewhere? > > My plan is to translate lisp names to C names by default, and have an override list for irregular names that can’t be translated correctly. What are "Lisp names" in this context? Are you saying that the name of a programming language, derived from the major mode, can be used to produce the name of the shared library programmatically? If so, how? > Just realized another problem, how do we make sure the loaded library is GPL-compatible? There certainly won’t be “plugin_is_GPL_compatible” symbol in them… And IIUC Emacs cannot load GPL-incompatible dynamic libraries? That's only needed for Emacs modules, not for external libraries that provide some extra functionality on the level of primitives. For those, we just make sure their license is compatible with GPL. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-11 5:51 ` Eli Zaretskii @ 2021-09-11 19:00 ` Yuan Fu 2021-09-11 19:14 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-09-11 19:00 UTC (permalink / raw) To: Eli Zaretskii Cc: Tuấn-Anh Nguyễn, theo, cpitclaudel, emacs-devel, monnier, stephen_leake > On Sep 10, 2021, at 10:51 PM, Eli Zaretskii <eliz@gnu.org> wrote: > >> From: Yuan Fu <casouri@gmail.com> >> Date: Fri, 10 Sep 2021 12:57:22 -0700 >> Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>, >> Theodor Thornhill <theo@thornhill.no>, >> Stephen Leake <stephen_leake@stephe-leake.org>, >> Clément Pit-Claudel <cpitclaudel@gmail.com>, >> Stefan Monnier <monnier@iro.umontreal.ca>, >> emacs-devel@gnu.org >> >>> Sure. I guess we will have to have a database of module names for >>> each programming language somewhere? >> >> My plan is to translate lisp names to C names by default, and have an override list for irregular names that can’t be translated correctly. > > What are "Lisp names" in this context? Are you saying that the name > of a programming language, derived from the major mode, can be used to > produce the name of the shared library programmatically? If so, how? I don’t think it’s a rule, but language definitions are conventionally named tree-sitter-<lang>. E.g. tree-sitter-c, tree-sitter-json, tree-sitter-c-sharp. And the symbol they expose are tree_sitter_<lang>, e.g., tree_sitter_c, tree_sitter_jon, tree_sitter_c_sharp. Currently we use a symbol tree-sitter-<lang> to represent a language, so we can translate the symbol tree-sitter-<lang> to tree-sitter-<lang>.so/dylib/dll to get the shared library name, and to tree_sitter_<lang> to get the C symbol name. BTW, since dynamic libraries has different extensions on different systems, what I want to do it to try loading the library with .so, then try .dylib, then try .dll, is that a good idea? > >> Just realized another problem, how do we make sure the loaded library is GPL-compatible? There certainly won’t be “plugin_is_GPL_compatible” symbol in them… And IIUC Emacs cannot load GPL-incompatible dynamic libraries? > > That's only needed for Emacs modules, not for external libraries that > provide some extra functionality on the level of primitives. For > those, we just make sure their license is compatible with GPL. Thanks, that’s all I need to know. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-11 19:00 ` Yuan Fu @ 2021-09-11 19:14 ` Eli Zaretskii 2021-09-11 19:17 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-09-11 19:14 UTC (permalink / raw) To: Yuan Fu; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake > From: Yuan Fu <casouri@gmail.com> > Date: Sat, 11 Sep 2021 12:00:59 -0700 > Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>, > theo@thornhill.no, > stephen_leake@stephe-leake.org, > cpitclaudel@gmail.com, > monnier@iro.umontreal.ca, > emacs-devel@gnu.org > > >> My plan is to translate lisp names to C names by default, and have an override list for irregular names that can’t be translated correctly. > > > > What are "Lisp names" in this context? Are you saying that the name > > of a programming language, derived from the major mode, can be used to > > produce the name of the shared library programmatically? If so, how? > > I don’t think it’s a rule, but language definitions are conventionally named tree-sitter-<lang>. E.g. tree-sitter-c, tree-sitter-json, tree-sitter-c-sharp. And the symbol they expose are tree_sitter_<lang>, e.g., tree_sitter_c, tree_sitter_jon, tree_sitter_c_sharp. Currently we use a symbol tree-sitter-<lang> to represent a language, so we can translate the symbol tree-sitter-<lang> to tree-sitter-<lang>.so/dylib/dll to get the shared library name, and to tree_sitter_<lang> to get the C symbol name. But the <lang> part is still needed to be concocted somehow. E.g., the conversion from "C#" to "c-sharp" isn't trivial. > BTW, since dynamic libraries has different extensions on different systems, what I want to do it to try loading the library with .so, then try .dylib, then try .dll, is that a good idea? We can do better, see load-suffixes. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-11 19:14 ` Eli Zaretskii @ 2021-09-11 19:17 ` Eli Zaretskii 2021-09-11 20:29 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-09-11 19:17 UTC (permalink / raw) To: casouri; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake > Date: Sat, 11 Sep 2021 22:14:26 +0300 > From: Eli Zaretskii <eliz@gnu.org> > Cc: ubolonton@gmail.com, theo@thornhill.no, cpitclaudel@gmail.com, > emacs-devel@gnu.org, monnier@iro.umontreal.ca, stephen_leake@stephe-leake.org > > > BTW, since dynamic libraries has different extensions on different systems, what I want to do it to try loading the library with .so, then try .dylib, then try .dll, is that a good idea? > > We can do better, see load-suffixes. And in C, you can use MODULES_SUFFIX directly. Though we will probably need some minor changes there, to have the suffix defined even in a build --without-modules. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-11 19:17 ` Eli Zaretskii @ 2021-09-11 20:29 ` Yuan Fu 2021-09-12 5:39 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-09-11 20:29 UTC (permalink / raw) To: Eli Zaretskii Cc: Tuấn-Anh Nguyễn, Theodor Thornhill, Clément Pit-Claudel, Emacs developers, Stefan Monnier, stephen_leake > On Sep 11, 2021, at 12:14 PM, Eli Zaretskii <eliz@gnu.org> wrote: > > But the <lang> part is still needed to be concocted somehow. E.g., > the conversion from "C#" to "c-sharp" isn't trivial. > The project name of tree-sitter’s C# definition is “tree-sitter-c-sharp”[1]. So if someone wants to use the C# language, they probably know what symbol represents it (we will explain the translation rule in doc-string and the manual). I also want to point out that we don’t come up with the symbols representing each language, the _user_ passes 'tree-sitter-parser-create' a symbol representing a language, and we translate that symbol to dynamic library name and C symbol name. [1]: https://github.com/tree-sitter/tree-sitter-c-sharp > On Sep 11, 2021, at 12:17 PM, Eli Zaretskii <eliz@gnu.org> wrote: > >> Date: Sat, 11 Sep 2021 22:14:26 +0300 >> From: Eli Zaretskii <eliz@gnu.org> >> Cc: ubolonton@gmail.com, theo@thornhill.no, cpitclaudel@gmail.com, >> emacs-devel@gnu.org, monnier@iro.umontreal.ca, stephen_leake@stephe-leake.org >> >>> BTW, since dynamic libraries has different extensions on different systems, what I want to do it to try loading the library with .so, then try .dylib, then try .dll, is that a good idea? >> >> We can do better, see load-suffixes. > > And in C, you can use MODULES_SUFFIX directly. Though we will > probably need some minor changes there, to have the suffix defined > even in a build --without-modules. I’m using tree-sitter-load-suffixes with default value ‘(“.so”, “.dylib”, “.dll”). Should I populate this variable with MODULES_SUFFIX and MODULES_SECONDARY_SUFFIX, or should I just use the two SUFFIX in C? I.e., do you see a need for users to customize suffixes? Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-11 20:29 ` Yuan Fu @ 2021-09-12 5:39 ` Eli Zaretskii 2021-09-13 4:15 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-09-12 5:39 UTC (permalink / raw) To: Yuan Fu; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake > From: Yuan Fu <casouri@gmail.com> > Date: Sat, 11 Sep 2021 13:29:09 -0700 > Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>, > Theodor Thornhill <theo@thornhill.no>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Emacs developers <emacs-devel@gnu.org>, > Stefan Monnier <monnier@iro.umontreal.ca>, > stephen_leake@stephe-leake.org > > > But the <lang> part is still needed to be concocted somehow. E.g., > > the conversion from "C#" to "c-sharp" isn't trivial. > > The project name of tree-sitter’s C# definition is “tree-sitter-c-sharp”[1]. So if someone wants to use the C# language, they probably know what symbol represents it (we will explain the translation rule in doc-string and the manual). I also want to point out that we don’t come up with the symbols representing each language, the _user_ passes 'tree-sitter-parser-create' a symbol representing a language, and we translate that symbol to dynamic library name and C symbol name. Surely, you don't mean "user" as in "the person who edits a source file"? I presume you mean the Lisp program, not the human user. That Lisp program is the major mode which wants to use TS services, and the only thing that it has in hand is its own symbol, like 'c-mode' or 'python-mode' or 'f90-mode'. It needs a way to pass the corresponding TS module name to TS, and my question is: how would the major mode compute the correct module name? We need either a mode-specific variable with that name, or some global function that could be used by any major mode to obtain the language module name. > >>> BTW, since dynamic libraries has different extensions on different systems, what I want to do it to try loading the library with .so, then try .dylib, then try .dll, is that a good idea? > >> > >> We can do better, see load-suffixes. > > > > And in C, you can use MODULES_SUFFIX directly. Though we will > > probably need some minor changes there, to have the suffix defined > > even in a build --without-modules. > > I’m using tree-sitter-load-suffixes with default value ‘(“.so”, “.dylib”, “.dll”). Should I populate this variable with MODULES_SUFFIX and MODULES_SECONDARY_SUFFIX, or should I just use the two SUFFIX in C? I.e., do you see a need for users to customize suffixes? I'd prefer a general variable shared-library-suffix(es), either a single value specific to the target system or an alist with keys being system names (from system-type). Then we could use that in load-suffixes (instead of MODULES_SUFFIX) and everywhere else. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-12 5:39 ` Eli Zaretskii @ 2021-09-13 4:15 ` Yuan Fu 2021-09-13 11:47 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-09-13 4:15 UTC (permalink / raw) To: Eli Zaretskii Cc: Tuấn-Anh Nguyễn, Theodor Thornhill, Clément Pit-Claudel, Emacs developers, Stefan Monnier, stephen_leake > On Sep 11, 2021, at 10:39 PM, Eli Zaretskii <eliz@gnu.org> wrote: > >> From: Yuan Fu <casouri@gmail.com> >> Date: Sat, 11 Sep 2021 13:29:09 -0700 >> Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>, >> Theodor Thornhill <theo@thornhill.no>, >> Clément Pit-Claudel <cpitclaudel@gmail.com>, >> Emacs developers <emacs-devel@gnu.org>, >> Stefan Monnier <monnier@iro.umontreal.ca>, >> stephen_leake@stephe-leake.org >> >>> But the <lang> part is still needed to be concocted somehow. E.g., >>> the conversion from "C#" to "c-sharp" isn't trivial. >> >> The project name of tree-sitter’s C# definition is “tree-sitter-c-sharp”[1]. So if someone wants to use the C# language, they probably know what symbol represents it (we will explain the translation rule in doc-string and the manual). I also want to point out that we don’t come up with the symbols representing each language, the _user_ passes 'tree-sitter-parser-create' a symbol representing a language, and we translate that symbol to dynamic library name and C symbol name. > > Surely, you don't mean "user" as in "the person who edits a source > file"? I presume you mean the Lisp program, not the human user. That > Lisp program is the major mode which wants to use TS services, and the > only thing that it has in hand is its own symbol, like 'c-mode' or > 'python-mode' or 'f90-mode'. It needs a way to pass the corresponding > TS module name to TS, and my question is: how would the major mode > compute the correct module name? We need either a mode-specific > variable with that name, or some global function that could be used by > any major mode to obtain the language module name. Not the end-user, no. But not really “Lisp Program”, either. I mean the human being writing the major-mode and adapting the major-mode to utilize tree-sitter features. The major mode writer should be able to figure out the correct symbol to use, if she go checks out the project name for the language definition, or the package name of the language definition in her package manager, or by some other means. For example, one should be able to figure out that tree-sitter-c is the symbol for C language definition, and tree-sitter-c-sharp that C#. Then Emacs automatically translate tree-sitter-c to libtree-sitter-c.so, and tree-sitter-c-sharp to libtree-sitter-c-sharp.so; basically adding “lib” and “.so” (or “dylib” etc). If that doesn’t give the correct library name for a quirky language, the major-mode writer can add an entry to tree-sitter-library-name-override-list—(tree-sitter-quirky-lang “libtree-sitter-qlang” “tree_sitter_qlang”)—and Emacs will use that. (Or she can just use tree-sitter-qlang as the symbol, and Emacs’ auto translation would just fine.) > >>>>> BTW, since dynamic libraries has different extensions on different systems, what I want to do it to try loading the library with .so, then try .dylib, then try .dll, is that a good idea? >>>> >>>> We can do better, see load-suffixes. >>> >>> And in C, you can use MODULES_SUFFIX directly. Though we will >>> probably need some minor changes there, to have the suffix defined >>> even in a build --without-modules. >> >> I’m using tree-sitter-load-suffixes with default value ‘(“.so”, “.dylib”, “.dll”). Should I populate this variable with MODULES_SUFFIX and MODULES_SECONDARY_SUFFIX, or should I just use the two SUFFIX in C? I.e., do you see a need for users to customize suffixes? > > I'd prefer a general variable shared-library-suffix(es), either a > single value specific to the target system or an alist with keys being > system names (from system-type). Then we could use that in > load-suffixes (instead of MODULES_SUFFIX) and everywhere else. To summarize, we have "load-suffixes” (".elc" ".el”, with M_SUFFIX & M_SEC_SUFFIX if modules enabled), "module-file-suffix” (M_SUFFIX if modules enabled), "load-file-rep-suffixes” ("" ".gz"). All contribute to the possible file names Emacs tries when loading a file (be it a Elisp file or an Emacs module). I will add a "shared-library-suffix” specifically for loading dynamic libraries, its value will be MODULES_SUFFIX regardless if module is enabled. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-13 4:15 ` Yuan Fu @ 2021-09-13 11:47 ` Eli Zaretskii 2021-09-13 18:01 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-09-13 11:47 UTC (permalink / raw) To: Yuan Fu; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake > From: Yuan Fu <casouri@gmail.com> > Date: Sun, 12 Sep 2021 21:15:31 -0700 > Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>, > Theodor Thornhill <theo@thornhill.no>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Emacs developers <emacs-devel@gnu.org>, > Stefan Monnier <monnier@iro.umontreal.ca>, > stephen_leake@stephe-leake.org > > Not the end-user, no. But not really “Lisp Program”, either. I mean the human being writing the major-mode and adapting the major-mode to utilize tree-sitter features. The major mode writer should be able to figure out the correct symbol to use, if she go checks out the project name for the language definition, or the package name of the language definition in her package manager, or by some other means. For example, one should be able to figure out that tree-sitter-c is the symbol for C language definition, and tree-sitter-c-sharp that C#. Then Emacs automatically translate tree-sitter-c to libtree-sitter-c.so, and tree-sitter-c-sharp to libtree-sitter-c-sharp.so; basically adding “lib” and “.so” (or “dylib” etc). If that doesn’t give the correct library name for a qui rky language, the major-mode writer can add an entry to tree-sitter-library-name-override-list—(tree-sitter-quirky-lang “libtree-sitter-qlang” “tree_sitter_qlang”)—and Emacs will use that. (Or she can just use tree-sitter-qlang as the symbol, and Emacs’ auto translation would just fine.) It makes little sense to me to request each major mode to figure this out. It should IMO be a service provided by the TS integration into Emacs. > To summarize, we have > > "load-suffixes” (".elc" ".el”, with M_SUFFIX & M_SEC_SUFFIX if modules enabled), > "module-file-suffix” (M_SUFFIX if modules enabled), > "load-file-rep-suffixes” ("" ".gz"). > > All contribute to the possible file names Emacs tries when loading a file (be it a Elisp file or an Emacs module). I will add a "shared-library-suffix” specifically for loading dynamic libraries, its value will be MODULES_SUFFIX regardless if module is enabled. Maybe the other way around: define a shared-library-suffix, and make MODULES_SUFFIX use that if Emacs is built with modules. Otherwise, SGTM, thanks. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-13 11:47 ` Eli Zaretskii @ 2021-09-13 18:01 ` Yuan Fu 2021-09-13 18:07 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-09-13 18:01 UTC (permalink / raw) To: Eli Zaretskii Cc: Tuấn-Anh Nguyễn, Theodor Thornhill, Clément Pit-Claudel, Emacs developers, Stefan Monnier, stephen_leake > On Sep 13, 2021, at 4:47 AM, Eli Zaretskii <eliz@gnu.org> wrote: > >> From: Yuan Fu <casouri@gmail.com> >> Date: Sun, 12 Sep 2021 21:15:31 -0700 >> Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>, >> Theodor Thornhill <theo@thornhill.no>, >> Clément Pit-Claudel <cpitclaudel@gmail.com>, >> Emacs developers <emacs-devel@gnu.org>, >> Stefan Monnier <monnier@iro.umontreal.ca>, >> stephen_leake@stephe-leake.org >> >> Not the end-user, no. But not really “Lisp Program”, either. I mean the human being writing the major-mode and adapting the major-mode to utilize tree-sitter features. The major mode writer should be able to figure out the correct symbol to use, if she go checks out the project name for the language definition, or the package name of the language definition in her package manager, or by some other means. For example, one should be able to figure out that tree-sitter-c is the symbol for C language definition, and tree-sitter-c-sharp that C#. Then Emacs automatically translate tree-sitter-c to libtree-sitter-c.so, and tree-sitter-c-sharp to libtree-sitter-c-sharp.so; basically adding “lib” and “.so” (or “dylib” etc). If that doesn’t give the correct library name for a quirky language, the major-mode writer can add an entry to tree-sitter-library-name-override-list—(tree-sitter-quirky-lang “libtree-sitter-qlang” “tree_sitter_qlang”)—and Emacs will use that. (Or she can just use tree-sitter-qlang as the symbol, and Emacs’ auto translation would just fine.) > > It makes little sense to me to request each major mode to figure this > out. It should IMO be a service provided by the TS integration into > Emacs. This is IMO the easiest and least confusing way for major-mode authors. But before we continue, what is the way you envisioned? I’m not sure what exactly is the service you want Emacs to provide. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-13 18:01 ` Yuan Fu @ 2021-09-13 18:07 ` Eli Zaretskii 2021-09-13 18:29 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-09-13 18:07 UTC (permalink / raw) To: Yuan Fu; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake > From: Yuan Fu <casouri@gmail.com> > Date: Mon, 13 Sep 2021 11:01:47 -0700 > Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>, > Theodor Thornhill <theo@thornhill.no>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Emacs developers <emacs-devel@gnu.org>, > Stefan Monnier <monnier@iro.umontreal.ca>, > stephen_leake@stephe-leake.org > > > It makes little sense to me to request each major mode to figure this > > out. It should IMO be a service provided by the TS integration into > > Emacs. > > This is IMO the easiest and least confusing way for major-mode authors. But before we continue, what is the way you envisioned? I’m not sure what exactly is the service you want Emacs to provide. What I had in mind is a function that give the major-mode symbol will return the name of the corresponding TS language module (or a list of modules, if there's more than one). ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-13 18:07 ` Eli Zaretskii @ 2021-09-13 18:29 ` Yuan Fu 2021-09-13 18:37 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-09-13 18:29 UTC (permalink / raw) To: Eli Zaretskii Cc: Tuấn-Anh Nguyễn, Theodor Thornhill, Clément Pit-Claudel, Emacs developers, Stefan Monnier, stephen_leake > On Sep 13, 2021, at 11:07 AM, Eli Zaretskii <eliz@gnu.org> wrote: > >> From: Yuan Fu <casouri@gmail.com> >> Date: Mon, 13 Sep 2021 11:01:47 -0700 >> Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>, >> Theodor Thornhill <theo@thornhill.no>, >> Clément Pit-Claudel <cpitclaudel@gmail.com>, >> Emacs developers <emacs-devel@gnu.org>, >> Stefan Monnier <monnier@iro.umontreal.ca>, >> stephen_leake@stephe-leake.org >> >>> It makes little sense to me to request each major mode to figure this >>> out. It should IMO be a service provided by the TS integration into >>> Emacs. >> >> This is IMO the easiest and least confusing way for major-mode authors. But before we continue, what is the way you envisioned? I’m not sure what exactly is the service you want Emacs to provide. > > What I had in mind is a function that give the major-mode symbol will > return the name of the corresponding TS language module (or a list of > modules, if there's more than one). My problem with such a function is that Emacs can’t possibly cover all the major-modes and tree-sitter languages. What if there is a new language, and someone wrote a tree-sitter language definition for it, and then want to write an Emacs major mode using tree-sitter features? Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-13 18:29 ` Yuan Fu @ 2021-09-13 18:37 ` Eli Zaretskii 2021-09-14 0:13 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-09-13 18:37 UTC (permalink / raw) To: Yuan Fu; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake > From: Yuan Fu <casouri@gmail.com> > Date: Mon, 13 Sep 2021 11:29:01 -0700 > Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>, > Theodor Thornhill <theo@thornhill.no>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Emacs developers <emacs-devel@gnu.org>, > Stefan Monnier <monnier@iro.umontreal.ca>, > stephen_leake@stephe-leake.org > > > What I had in mind is a function that give the major-mode symbol will > > return the name of the corresponding TS language module (or a list of > > modules, if there's more than one). > > My problem with such a function is that Emacs can’t possibly cover all the major-modes and tree-sitter languages. What if there is a new language, and someone wrote a tree-sitter language definition for it, and then want to write an Emacs major mode using tree-sitter features? A new major mode will extend the function to support its language(s). the extension could be as simple as adding something to a database of known mode-to-language associations in some alist. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-13 18:37 ` Eli Zaretskii @ 2021-09-14 0:13 ` Yuan Fu 2021-09-14 2:29 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-09-14 0:13 UTC (permalink / raw) To: Eli Zaretskii Cc: Tuấn-Anh Nguyễn, Theodor Thornhill, Clément Pit-Claudel, Emacs developers, Stefan Monnier, stephen_leake >> >>> What I had in mind is a function that give the major-mode symbol will >>> return the name of the corresponding TS language module (or a list of >>> modules, if there's more than one). >> >> My problem with such a function is that Emacs can’t possibly cover all the major-modes and tree-sitter languages. What if there is a new language, and someone wrote a tree-sitter language definition for it, and then want to write an Emacs major mode using tree-sitter features? > > A new major mode will extend the function to support its language(s). > the extension could be as simple as adding something to a database of > known mode-to-language associations in some alist. > Just to recap, we were talking about how to represent a tree-sitter language in Emacs and how to figure out the dynamic library name for that language. My plan is to use tree-sitter-<lang> to represent a language, which is usually the project name for that language definition. And we just turn it into libtree-sitter-<lang>.so/dylib/dll to get the name of the dynamic library. I think your idea has evolved into another thing—translating major-mode to the tree-sitter languages it uses could be useful, but how does it help with the original topic (representing language, translate to library name)? Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-14 0:13 ` Yuan Fu @ 2021-09-14 2:29 ` Eli Zaretskii 2021-09-14 4:27 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-09-14 2:29 UTC (permalink / raw) To: Yuan Fu; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake > From: Yuan Fu <casouri@gmail.com> > Date: Mon, 13 Sep 2021 17:13:40 -0700 > Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>, > Theodor Thornhill <theo@thornhill.no>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Emacs developers <emacs-devel@gnu.org>, > Stefan Monnier <monnier@iro.umontreal.ca>, > stephen_leake@stephe-leake.org > > > A new major mode will extend the function to support its language(s). > > the extension could be as simple as adding something to a database of > > known mode-to-language associations in some alist. > > > > Just to recap, we were talking about how to represent a tree-sitter language in Emacs and how to figure out the dynamic library name for that language. My plan is to use tree-sitter-<lang> to represent a language, which is usually the project name for that language definition. And we just turn it into libtree-sitter-<lang>.so/dylib/dll to get the name of the dynamic library. I think your idea has evolved into another thing—translating major-mode to the tree-sitter languages it uses could be useful, but how does it help with the original topic (representing language, translate to library name)? I guess I don't see a problem there? What is the problem? ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-14 2:29 ` Eli Zaretskii @ 2021-09-14 4:27 ` Yuan Fu 2021-09-14 11:29 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-09-14 4:27 UTC (permalink / raw) To: Eli Zaretskii Cc: Tuấn-Anh Nguyễn, Theodor Thornhill, Clément Pit-Claudel, Emacs developers, Stefan Monnier, stephen_leake >> >> Just to recap, we were talking about how to represent a tree-sitter language in Emacs and how to figure out the dynamic library name for that language. My plan is to use tree-sitter-<lang> to represent a language, which is usually the project name for that language definition. And we just turn it into libtree-sitter-<lang>.so/dylib/dll to get the name of the dynamic library. I think your idea has evolved into another thing—translating major-mode to the tree-sitter languages it uses could be useful, but how does it help with the original topic (representing language, translate to library name)? > > I guess I don't see a problem there? What is the problem? I thought you proposed the major mode thing to replace the naming scheme, because we were talking about naming languages and translating language names to library names when you proposed it. So you agree to to the initial plan to translate tree-sitter-<lang> to libtree-sitter-<lang>.so/etc, and to use an override alist for irregular names? Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-14 4:27 ` Yuan Fu @ 2021-09-14 11:29 ` Eli Zaretskii 2021-09-15 0:50 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-09-14 11:29 UTC (permalink / raw) To: Yuan Fu; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake > From: Yuan Fu <casouri@gmail.com> > Date: Mon, 13 Sep 2021 21:27:00 -0700 > Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>, > Theodor Thornhill <theo@thornhill.no>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Emacs developers <emacs-devel@gnu.org>, > Stefan Monnier <monnier@iro.umontreal.ca>, > stephen_leake@stephe-leake.org > > > I guess I don't see a problem there? What is the problem? > > I thought you proposed the major mode thing to replace the naming scheme, because we were talking about naming languages and translating language names to library names when you proposed it. So you agree to to the initial plan to translate tree-sitter-<lang> to libtree-sitter-<lang>.so/etc, and to use an override alist for irregular names? Almost: there's the (minor) problem of obtaining the "<lang>" part by the major-mode. I think it would be good to have a utility function to do that so that major modes won't need to reinvent the wheel, do the research, etc. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-14 11:29 ` Eli Zaretskii @ 2021-09-15 0:50 ` Yuan Fu 2021-09-15 6:15 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-09-15 0:50 UTC (permalink / raw) To: Eli Zaretskii Cc: Tuấn-Anh Nguyễn, Theodor Thornhill, Clément Pit-Claudel, Emacs developers, Stefan Monnier, stephen_leake >> >> I thought you proposed the major mode thing to replace the naming scheme, because we were talking about naming languages and translating language names to library names when you proposed it. So you agree to to the initial plan to translate tree-sitter-<lang> to libtree-sitter-<lang>.so/etc, and to use an override alist for irregular names? > > Almost: there's the (minor) problem of obtaining the "<lang>" part by > the major-mode. I think it would be good to have a utility function > to do that so that major modes won't need to reinvent the wheel, do > the research, etc. That’s where I don’t understand: the major mode is written by major mode writers, who certainly know the correct “<lang>” name: they need to read the source of the language definition to use language’s tree-sitter features. You seem to agree on that because you said that this function can be extended by major mode writers. But if you mean an ordinary end user need to know the correct <lang> name, then it makes more sense. In that case, it is not Emacs’ but major mode writers’ responsibility to teach that function how to map a major mode to one or many tree-sitter langauge names. Is that what you meant? Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-15 0:50 ` Yuan Fu @ 2021-09-15 6:15 ` Eli Zaretskii 2021-09-15 15:56 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-09-15 6:15 UTC (permalink / raw) To: Yuan Fu; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake > From: Yuan Fu <casouri@gmail.com> > Date: Tue, 14 Sep 2021 17:50:48 -0700 > Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>, > Theodor Thornhill <theo@thornhill.no>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Emacs developers <emacs-devel@gnu.org>, > Stefan Monnier <monnier@iro.umontreal.ca>, > stephen_leake@stephe-leake.org > > > Almost: there's the (minor) problem of obtaining the "<lang>" part by > > the major-mode. I think it would be good to have a utility function > > to do that so that major modes won't need to reinvent the wheel, do > > the research, etc. > > That’s where I don’t understand: the major mode is written by major mode writers, who certainly know the correct “<lang>” name: they need to read the source of the language definition to use language’s tree-sitter features. You seem to agree on that because you said that this function can be extended by major mode writers. I don't understand what you are saying here. Why would major mode programmers need to know the correct <lang> name? The TS facilities we will have in Emacs will be language-agnostic, right? For example, to correctly indent a line of code, the major mode will call some hypothetical tree-sitter-get-indentation function, and that function will work in any major mode, provided that the major mode told TS to load the support for the programming language of the buffer. Right? So when the major mode initializes for working with TS, it should tell TS which language to load, and why would we request the major mode programmer to know the correct <lang> name which corresponds to the major mode's programming language? Why would they need to "read the source of the language definition to use language’s tree-sitter features"? The specifics of the TS implementation of, say, indentation calculations won't be exposed on the level of the indentation facilities provided by TS integration in Emacs, right? There's some misunderstanding here, and I cannot for the life of me figure out where is it. > But if you mean an ordinary end user need to know the correct <lang> name No, that's not the issue I'm talking about. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-15 6:15 ` Eli Zaretskii @ 2021-09-15 15:56 ` Yuan Fu 2021-09-15 16:02 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-09-15 15:56 UTC (permalink / raw) To: Eli Zaretskii Cc: Tuấn-Anh Nguyễn, Theodor Thornhill, Clément Pit-Claudel, Emacs developers, Stefan Monnier, stephen_leake > On Sep 14, 2021, at 11:15 PM, Eli Zaretskii <eliz@gnu.org> wrote: > >> From: Yuan Fu <casouri@gmail.com> >> Date: Tue, 14 Sep 2021 17:50:48 -0700 >> Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>, >> Theodor Thornhill <theo@thornhill.no>, >> Clément Pit-Claudel <cpitclaudel@gmail.com>, >> Emacs developers <emacs-devel@gnu.org>, >> Stefan Monnier <monnier@iro.umontreal.ca>, >> stephen_leake@stephe-leake.org >> >>> Almost: there's the (minor) problem of obtaining the "<lang>" part by >>> the major-mode. I think it would be good to have a utility function >>> to do that so that major modes won't need to reinvent the wheel, do >>> the research, etc. >> >> That’s where I don’t understand: the major mode is written by major mode writers, who certainly know the correct “<lang>” name: they need to read the source of the language definition to use language’s tree-sitter features. You seem to agree on that because you said that this function can be extended by major mode writers. > > I don't understand what you are saying here. Why would major mode > programmers need to know the correct <lang> name? The TS facilities > we will have in Emacs will be language-agnostic, right? For example, > to correctly indent a line of code, the major mode will call some > hypothetical tree-sitter-get-indentation function, and that function > will work in any major mode, provided that the major mode told TS to > load the support for the programming language of the buffer. Right? Now I see why there is confusion. Tree-sitter only provide a “primitive” feature: the concert syntax tree, and it is not language-agnostic. You don’t get indentation for free, unfortunately. Indenting the program by the information from the syntax tree is our problem. Tree-sitter doesn’t have anything like tree-sitter-get-indentation function, and there is no mechanical way to provide one, a human needs to read the source of the tree-sitter language definition and figure out how to do it. See below. > So when the major mode initializes for working with TS, it should tell > TS which language to load, and why would we request the major mode > programmer to know the correct <lang> name which corresponds to the > major mode's programming language? Why would they need to "read the > source of the language definition to use language’s tree-sitter > features"? The specifics of the TS implementation of, say, > indentation calculations won't be exposed on the level of the > indentation facilities provided by TS integration in Emacs, right? Tree-sitter has no indentation calculation feature. Major mode writers genuinely need to read the source of the tree-sitter language definition. The source tells us what will be in the syntax tree parsed by tree-sitter, and the node names differ from one language to another. For example, if I want to fontify type identifiers in C with font-lock-type-face, I need to know how is type represented in the syntax tree. I look up the source[1], and find _type_specifier: $ => choice( $.struct_specifier, $.union_specifier, $.enum_specifier, $.macro_type_specifier, $.sized_type_specifier, $.primitive_type, $._type_identifier ), This roughly translates to _type_specifier := <struct_specifier> | <union_specifier> | <enum_specifier> | <macro_type_specifier> | <sized_type_specifier> | <primitive_type> | <_type_identifier> in BNF From this (and some other hint) I know I need to grab all the _type_specifier nodes in the syntax tree, find their corresponding text in the buffer, and apply font-lock-type-face. And type identifiers in another language will be named differently, tree-sitter doesn’t provide an abstraction for semantic names in the syntax tree. > > There's some misunderstanding here, and I cannot for the life of me > figure out where is it. I was very confused, too, for the past several days, but I think we know the source of it now. [1] The source of tree-sitter-c is at https://github.com/tree-sitter/tree-sitter-c/blob/master/grammar.js Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-15 15:56 ` Yuan Fu @ 2021-09-15 16:02 ` Eli Zaretskii 2021-09-15 18:19 ` Stefan Monnier 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-09-15 16:02 UTC (permalink / raw) To: Yuan Fu; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake > From: Yuan Fu <casouri@gmail.com> > Date: Wed, 15 Sep 2021 08:56:18 -0700 > Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>, > Theodor Thornhill <theo@thornhill.no>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Emacs developers <emacs-devel@gnu.org>, > Stefan Monnier <monnier@iro.umontreal.ca>, > stephen_leake@stephe-leake.org > > > I don't understand what you are saying here. Why would major mode > > programmers need to know the correct <lang> name? The TS facilities > > we will have in Emacs will be language-agnostic, right? For example, > > to correctly indent a line of code, the major mode will call some > > hypothetical tree-sitter-get-indentation function, and that function > > will work in any major mode, provided that the major mode told TS to > > load the support for the programming language of the buffer. Right? > > Now I see why there is confusion. Tree-sitter only provide a “primitive” feature: the concert syntax tree, and it is not language-agnostic. You don’t get indentation for free, unfortunately. I wasn't talking about tree-sitter itself, I was talking about the facilities Emacs will provide based on TS. There will be in Emacs a function to calculate indentation using TS, right? And that function will be language-agnostic, like indent-line-function is, right? ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-15 16:02 ` Eli Zaretskii @ 2021-09-15 18:19 ` Stefan Monnier 2021-09-15 18:48 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Stefan Monnier @ 2021-09-15 18:19 UTC (permalink / raw) To: Eli Zaretskii Cc: Yuan Fu, ubolonton, theo, cpitclaudel, emacs-devel, stephen_leake > I wasn't talking about tree-sitter itself, I was talking about the > facilities Emacs will provide based on TS. There will be in Emacs a > function to calculate indentation using TS, right? And that function > will be language-agnostic, like indent-line-function is, right? There is such a function but it doesn't do anything itself. It relies on the major-mode to do the heavy lifting which consists in giving indentation rules for each one of the possible node types that can appear in the AST (and as Yuan explained, those types are different for every language). Stefan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-15 18:19 ` Stefan Monnier @ 2021-09-15 18:48 ` Eli Zaretskii 2021-09-16 21:46 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-09-15 18:48 UTC (permalink / raw) To: Stefan Monnier Cc: casouri, theo, ubolonton, emacs-devel, cpitclaudel, stephen_leake > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: Yuan Fu <casouri@gmail.com>, ubolonton@gmail.com, theo@thornhill.no, > cpitclaudel@gmail.com, emacs-devel@gnu.org, > stephen_leake@stephe-leake.org > Date: Wed, 15 Sep 2021 14:19:12 -0400 > > > I wasn't talking about tree-sitter itself, I was talking about the > > facilities Emacs will provide based on TS. There will be in Emacs a > > function to calculate indentation using TS, right? And that function > > will be language-agnostic, like indent-line-function is, right? > > There is such a function but it doesn't do anything itself. It relies > on the major-mode to do the heavy lifting which consists in giving > indentation rules for each one of the possible node types that can > appear in the AST Sure, and the new one will do that with help of TS. But the principle is the same. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-15 18:48 ` Eli Zaretskii @ 2021-09-16 21:46 ` Yuan Fu 2021-09-17 6:06 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-09-16 21:46 UTC (permalink / raw) To: Eli Zaretskii Cc: ubolonton, theo, cpitclaudel, emacs-devel, Stefan Monnier, stephen_leake > On Sep 15, 2021, at 11:48 AM, Eli Zaretskii <eliz@gnu.org> wrote: > >> From: Stefan Monnier <monnier@iro.umontreal.ca> >> Cc: Yuan Fu <casouri@gmail.com>, ubolonton@gmail.com, theo@thornhill.no, >> cpitclaudel@gmail.com, emacs-devel@gnu.org, >> stephen_leake@stephe-leake.org >> Date: Wed, 15 Sep 2021 14:19:12 -0400 >> >>> I wasn't talking about tree-sitter itself, I was talking about the >>> facilities Emacs will provide based on TS. There will be in Emacs a >>> function to calculate indentation using TS, right? And that function >>> will be language-agnostic, like indent-line-function is, right? >> >> There is such a function but it doesn't do anything itself. It relies >> on the major-mode to do the heavy lifting which consists in giving >> indentation rules for each one of the possible node types that can >> appear in the AST > > Sure, and the new one will do that with help of TS. But the principle > is the same. My point is, major mode writers need to read the source of the tree-sitter language definition to do anything useful with tree-sitter, therefore they must know the correct <lang> name. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-16 21:46 ` Yuan Fu @ 2021-09-17 6:06 ` Eli Zaretskii 2021-09-17 6:56 ` Yuan Fu 2021-09-17 12:23 ` Stefan Monnier 0 siblings, 2 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-09-17 6:06 UTC (permalink / raw) To: Yuan Fu; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake > From: Yuan Fu <casouri@gmail.com> > Date: Thu, 16 Sep 2021 14:46:08 -0700 > Cc: Stefan Monnier <monnier@iro.umontreal.ca>, > ubolonton@gmail.com, > theo@thornhill.no, > cpitclaudel@gmail.com, > emacs-devel@gnu.org, > stephen_leake@stephe-leake.org > > >>> I wasn't talking about tree-sitter itself, I was talking about the > >>> facilities Emacs will provide based on TS. There will be in Emacs a > >>> function to calculate indentation using TS, right? And that function > >>> will be language-agnostic, like indent-line-function is, right? > >> > >> There is such a function but it doesn't do anything itself. It relies > >> on the major-mode to do the heavy lifting which consists in giving > >> indentation rules for each one of the possible node types that can > >> appear in the AST > > > > Sure, and the new one will do that with help of TS. But the principle > > is the same. > > My point is, major mode writers need to read the source of the tree-sitter language definition to do anything useful with tree-sitter If this is so, then why do we bother documenting the Lisp APIs for TS-related features? If Lisp programmers need to read the TS sources to do anything useful in Emacs, let them read the sources, including the Lisp and C sources you are working on? That was somewhat sarcastic, but my point is that this is NOT how we do this kind of stuff in Emacs. We should have Lisp-level facilities that reflect the TS features, and those Lisp-level facilities should be documented and should be the ONLY thing a Lisp programmer needs to read to adapt his/her major mode to TS. We should NOT assume that Lisp programmers read the TS source code, exactly like we don't assume that for other libraries, like GnuTLS, librsvg, or libgccjit. Under that modus operandi, the way to glean the <lang> part from the major mode's language name is something that should be part of the facilities we provide. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-17 6:06 ` Eli Zaretskii @ 2021-09-17 6:56 ` Yuan Fu 2021-09-17 7:38 ` Eli Zaretskii 2021-09-17 12:11 ` Tuấn-Anh Nguyễn 2021-09-17 12:23 ` Stefan Monnier 1 sibling, 2 replies; 370+ messages in thread From: Yuan Fu @ 2021-09-17 6:56 UTC (permalink / raw) To: Eli Zaretskii Cc: Tuấn-Anh Nguyễn, Theodor Thornhill, Clément Pit-Claudel, Emacs developers, Stefan Monnier, stephen_leake >> >> My point is, major mode writers need to read the source of the tree-sitter language definition to do anything useful with tree-sitter > > If this is so, then why do we bother documenting the Lisp APIs for > TS-related features? If Lisp programmers need to read the TS sources > to do anything useful in Emacs, let them read the sources, including > the Lisp and C sources you are working on? > > That was somewhat sarcastic, but my point is that this is NOT how we > do this kind of stuff in Emacs. We should have Lisp-level facilities > that reflect the TS features, and those Lisp-level facilities should > be documented and should be the ONLY thing a Lisp programmer needs to > read to adapt his/her major mode to TS. We should NOT assume that > Lisp programmers read the TS source code, exactly like we don't assume > that for other libraries, like GnuTLS, librsvg, or libgccjit. Under > that modus operandi, the way to glean the <lang> part from the major > mode's language name is something that should be part of the > facilities we provide. Thank you for your patience. I certainly believe in documentation and put considerable effort into it, and if it is possible to document as you described, I would do it. We have documentation for all the tree-sitter features provided by Emacs and a bit more, but I don’t think it is possible to document the language definitions. We can think of language definitions as BNF grammars for each language, how do you document that? Say, for the language definition for Scheme below, how do we document it? <token> --> <identifier> | <boolean> | <number> | <character> | <string> | ( | ) | #( | ' | ` | , | ,@ | . <delimiter> --> <whitespace> | ( | ) | " | ; <whitespace> --> <space or newline> <comment> --> ; <all subsequent characters up to a line break> ... <number> --> <num 2>| <num 8> | <num 10>| <num 16> … The language definition source of a tree-sitter language is basically that, with some superfluous javascript syntax. Language definitions are not mechanic, but rather data—you can document mechanic but not really data. And I want to also point out that as Emacs core developers, we can’t possibly provide a good translation from convention language names to their tree-sitter name (C# -> c-sharp). Maybe we can do a half-decent job, but 1) that won’t cover all available languages, and 2) if there is a new language, we need to wait for the next release to update our translation. It is better for the major mode writers to provide the information on how to translate names. Because, as I said earlier, they already know it. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-17 6:56 ` Yuan Fu @ 2021-09-17 7:38 ` Eli Zaretskii 2021-09-17 20:30 ` Yuan Fu 2021-09-18 12:33 ` Stephen Leake 2021-09-17 12:11 ` Tuấn-Anh Nguyễn 1 sibling, 2 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-09-17 7:38 UTC (permalink / raw) To: Yuan Fu; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake > From: Yuan Fu <casouri@gmail.com> > Date: Thu, 16 Sep 2021 23:56:20 -0700 > Cc: Stefan Monnier <monnier@iro.umontreal.ca>, > Tuấn-Anh Nguyễn <ubolonton@gmail.com>, > Theodor Thornhill <theo@thornhill.no>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Emacs developers <emacs-devel@gnu.org>, > stephen_leake@stephe-leake.org > > We have documentation for all the tree-sitter features provided by Emacs and a bit more, but I don’t think it is possible to document the language definitions. We can think of language definitions as BNF grammars for each language, how do you document that? Why do we need to document the language definitions? When a Lisp programmer defines font-lock and indentation for a programming language in the current Emacs, do they necessarily need to consult the language grammar? > Say, for the language definition for Scheme below, how do we document it? > > <token> --> <identifier> | <boolean> | <number> > | <character> | <string> > | ( | ) | #( | > ' | ` | , | ,@ | . > <delimiter> --> <whitespace> | ( | ) | " | ; > <whitespace> --> <space or newline> > <comment> --> ; <all subsequent characters up to a > line break> > ... > <number> --> <num 2>| <num 8> > | <num 10>| <num 16> > … This stuff should be known to TS; the Lisp programmer only needs to be aware of the results of lexical and syntactical analysis, in terms of their Lisp expressions (Lisp data structures with appropriate symbols and fields). > And I want to also point out that as Emacs core developers, we can’t possibly provide a good translation from convention language names to their tree-sitter name (C# -> c-sharp). Maybe we can do a half-decent job, but 1) that won’t cover all available languages, and 2) if there is a new language, we need to wait for the next release to update our translation. It is better for the major mode writers to provide the information on how to translate names. The database used by the conversion should definitely be extensible. But that doesn't mean it should be empty. Anyway, we've spent enough time on this issue. If you are still unconvinced, feel free to do it your way, and let the chips fall as they may. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-17 7:38 ` Eli Zaretskii @ 2021-09-17 20:30 ` Yuan Fu 2021-09-18 2:22 ` Tuấn-Anh Nguyễn 2021-09-18 12:33 ` Stephen Leake 1 sibling, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-09-17 20:30 UTC (permalink / raw) To: Eli Zaretskii Cc: Tuấn-Anh Nguyễn, Theodor Thornhill, Clément Pit-Claudel, Emacs developers, Stefan Monnier, stephen_leake > > Why do we need to document the language definitions? When a Lisp > programmer defines font-lock and indentation for a programming > language in the current Emacs, do they necessarily need to consult the > language grammar? > […] > This stuff should be known to TS; the Lisp programmer only needs to be > aware of the results of lexical and syntactical analysis, in terms of > their Lisp expressions (Lisp data structures with appropriate symbols > and fields). I demonstrate the reason why one needs to consult the source a few messages back: > Tree-sitter has no indentation calculation feature. Major mode writers genuinely need to read the source of the tree-sitter language definition. The source tells us what will be in the syntax tree parsed by tree-sitter, and the node names differ from one language to another. For example, if I want to fontify type identifiers in C with font-lock-type-face, I need to know how is type represented in the syntax tree. I look up the source[1], and find > > _type_specifier: $ => choice( > $.struct_specifier, > $.union_specifier, > $.enum_specifier, > $.macro_type_specifier, > $.sized_type_specifier, > $.primitive_type, > $._type_identifier > ), > > This roughly translates to > > _type_specifier := <struct_specifier> > | <union_specifier> > | <enum_specifier> > | <macro_type_specifier> > | <sized_type_specifier> > | <primitive_type> > | <_type_identifier> > > in BNF > > From this (and some other hint) I know I need to grab all the _type_specifier nodes in the syntax tree, find their corresponding text in the buffer, and apply font-lock-type-face. And type identifiers in another language will be named differently, tree-sitter doesn’t provide an abstraction for semantic names in the syntax tree. >> And I want to also point out that as Emacs core developers, we can’t possibly provide a good translation from convention language names to their tree-sitter name (C# -> c-sharp). Maybe we can do a half-decent job, but 1) that won’t cover all available languages, and 2) if there is a new language, we need to wait for the next release to update our translation. It is better for the major mode writers to provide the information on how to translate names. > > The database used by the conversion should definitely be extensible. > But that doesn't mean it should be empty. > > Anyway, we've spent enough time on this issue. If you are still > unconvinced, feel free to do it your way, and let the chips fall as > they may. I’ll do it the way I see fit. You can always comment in the final review (or something). Thanks. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-17 20:30 ` Yuan Fu @ 2021-09-18 2:22 ` Tuấn-Anh Nguyễn 2021-09-18 6:38 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Tuấn-Anh Nguyễn @ 2021-09-18 2:22 UTC (permalink / raw) To: Yuan Fu Cc: Clément Pit-Claudel, Theodor Thornhill, Emacs developers, Stefan Monnier, Eli Zaretskii, Stephen Leake > > Tree-sitter has no indentation calculation feature. Major mode writers genuinely need to read the source of the tree-sitter language definition. The source tells us what will be in the syntax tree parsed by tree-sitter, and the node names differ from one language to another. For example, if I want to fontify type identifiers in C with font-lock-type-face, I need to know how is type represented in the syntax tree. I look up the source[1], and find > > > > _type_specifier: $ => choice( > > $.struct_specifier, > > $.union_specifier, > > $.enum_specifier, > > $.macro_type_specifier, > > $.sized_type_specifier, > > $.primitive_type, > > $._type_identifier > > ), > > > > This roughly translates to > > > > _type_specifier := <struct_specifier> > > | <union_specifier> > > | <enum_specifier> > > | <macro_type_specifier> > > | <sized_type_specifier> > > | <primitive_type> > > | <_type_identifier> > > > > in BNF > > > > From this (and some other hint) I know I need to grab all the _type_specifier nodes in the syntax tree, find their corresponding text in the buffer, and apply font-lock-type-face. And type identifiers in another language will be named differently, tree-sitter doesn’t provide an abstraction for semantic names in the syntax tree. > > > >> And I want to also point out that as Emacs core developers, we can’t possibly provide a good translation from convention language names to their tree-sitter name (C# -> c-sharp). Maybe we can do a half-decent job, but 1) that won’t cover all available languages, and 2) if there is a new language, we need to wait for the next release to update our translation. It is better for the major mode writers to provide the information on how to translate names. > > > > The database used by the conversion should definitely be extensible. > > But that doesn't mean it should be empty. > > > > Anyway, we've spent enough time on this issue. If you are still > > unconvinced, feel free to do it your way, and let the chips fall as > > they may. > > I’ll do it the way I see fit. You can always comment in the final review (or something). Thanks. Your arguments were reasonable. Please continue the work. It's quite valuable. There will be a lot more important details to discuss. -- Tuấn-Anh Nguyễn Software Engineer ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-18 2:22 ` Tuấn-Anh Nguyễn @ 2021-09-18 6:38 ` Yuan Fu 0 siblings, 0 replies; 370+ messages in thread From: Yuan Fu @ 2021-09-18 6:38 UTC (permalink / raw) To: Tuấn-Anh Nguyễn Cc: Clément Pit-Claudel, Theodor Thornhill, Emacs developers, Stefan Monnier, Eli Zaretskii, Stephen Leake > > Your arguments were reasonable. Please continue the work. It's quite valuable. > There will be a lot more important details to discuss. Thanks. I’ve pushed a change to the branch on GitHub, and now Emacs loads dynamic libraries directly. The name of the library to load can be controlled by tree-sitter-load-name-list. Please have a look and comment as you like. In the mean time, I’ll be working on tests and documentation. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-17 7:38 ` Eli Zaretskii 2021-09-17 20:30 ` Yuan Fu @ 2021-09-18 12:33 ` Stephen Leake 2021-09-20 16:48 ` Yuan Fu 1 sibling, 1 reply; 370+ messages in thread From: Stephen Leake @ 2021-09-18 12:33 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Yuan Fu, theo, ubolonton, emacs-devel, cpitclaudel, monnier Eli Zaretskii <eliz@gnu.org> writes: >> From: Yuan Fu <casouri@gmail.com> >> >> We have documentation for all the tree-sitter features provided by >> Emacs and a bit more, but I don’t think it is possible to document >> the language definitions. We can think of language definitions as >> BNF grammars for each language, how do you document that? > > Why do we need to document the language definitions? When a Lisp > programmer defines font-lock and indentation for a programming > language in the current Emacs, do they necessarily need to consult the > language grammar? Yes! If you want to indent a statment in a language, you need to know the syntax of that statement; you can't define indent for a "generic if statement). Consider Ada and C: Ada: if <expression> then <statement_list> else <statement_list> end if; C: if (<expression>) <block> else <block> The language details matter. -- -- Stephe ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-18 12:33 ` Stephen Leake @ 2021-09-20 16:48 ` Yuan Fu 2021-09-20 18:48 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-09-20 16:48 UTC (permalink / raw) To: Stephen Leake Cc: Tuấn-Anh Nguyễn, Theodor Thornhill, Clément Pit-Claudel, emacs-devel, Stefan Monnier, Eli Zaretskii A minor question, is there a better CS term for “punctuation marks”? Names like @code{root}, @code{expression}, @code{number}, @code{operator} are nodes' @dfn{type}. However, not all nodes in a syntax tree have a type. Nodes that don't are @dfn{anonymous nodes}, and nodes with a type are @dfn{named nodes}. Anonymous nodes usually represent punctuation marks (FIXME: better word than ``puncturation marks''?) like quote @samp{"} and bracket @samp{[}, or tokens that have a fixed representation, such as keywords like @code{return}. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-20 16:48 ` Yuan Fu @ 2021-09-20 18:48 ` Eli Zaretskii 2021-09-20 19:09 ` John Yates 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-09-20 18:48 UTC (permalink / raw) To: Yuan Fu; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake > From: Yuan Fu <casouri@gmail.com> > Date: Mon, 20 Sep 2021 09:48:27 -0700 > Cc: Eli Zaretskii <eliz@gnu.org>, > Stefan Monnier <monnier@iro.umontreal.ca>, > Tuấn-Anh Nguyễn <ubolonton@gmail.com>, > Theodor Thornhill <theo@thornhill.no>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > emacs-devel@gnu.org > > A minor question, is there a better CS term for “punctuation marks”? Punctuation characters, I think. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-20 18:48 ` Eli Zaretskii @ 2021-09-20 19:09 ` John Yates 2021-09-21 22:20 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: John Yates @ 2021-09-20 19:09 UTC (permalink / raw) To: Eli Zaretskii Cc: Yuan Fu, theo, ubolonton, Emacs developers, cpitclaudel, Stefan Monnier, Stephen Leake [-- Attachment #1: Type: text/plain, Size: 223 bytes --] Maybe: Anonymous nodes are usually tokens composed of punctuation characters like quote @samp{"} and auto-increment @samp{++}, or distinguished identifiers with fixed spellings used as keywords, like @code{return}. /john [-- Attachment #2: Type: text/html, Size: 1454 bytes --] ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-20 19:09 ` John Yates @ 2021-09-21 22:20 ` Yuan Fu 2021-09-27 4:42 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-09-21 22:20 UTC (permalink / raw) To: John Yates Cc: Tuấn-Anh Nguyễn, Theodor Thornhill, Clément Pit-Claudel, Emacs developers, Stefan Monnier, Eli Zaretskii, Stephen Leake > On Sep 20, 2021, at 12:09 PM, John Yates <john@yates-sheets.org> wrote: > > Maybe: > > Anonymous nodes are usually tokens composed of > punctuation characters like quote @samp{"} and > auto-increment @samp{++}, or distinguished identifiers > with fixed spellings used as keywords, like @code{return}. > > /John Thanks, John and Eli. I modified it to Names like @code{root}, @code{expression}, @code{number}, @code{operator} are nodes' @dfn{type}. However, not all nodes in a syntax tree have a type. Nodes that don't are @dfn{anonymous nodes}, and nodes with a type are @dfn{named nodes}. Anonymous nodes are tokens with fixed spellings, including punctuation characters like bracket @samp{]}, and keywords like @code{return}. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-21 22:20 ` Yuan Fu @ 2021-09-27 4:42 ` Yuan Fu 2021-09-27 5:37 ` Eli Zaretskii 2021-09-27 19:17 ` Stefan Monnier 0 siblings, 2 replies; 370+ messages in thread From: Yuan Fu @ 2021-09-27 4:42 UTC (permalink / raw) To: Eli Zaretskii Cc: Tuấn-Anh Nguyễn, Theodor Thornhill, Clément Pit-Claudel, Emacs developers, Stefan Monnier, Stephen Leake, John Yates Currently, because font-lock.el uses functions and variables defined in tree-sitter.el, it needs to require tree-sitter.el. Should we require tree-sitter.el by default? Then what do we do when tree-sitter is not available on the system? Should I wrap every reference to tree-sitter in font-lock.el with (when (featurep ’tree-sitter))? Or is there better ways to deal with this? Another approach is to define everything tree-sitter related in tree-sitter.el, and make tree-sitter.el require font-lock.el instead of the other way around. Would that be better? Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-27 4:42 ` Yuan Fu @ 2021-09-27 5:37 ` Eli Zaretskii 2021-09-27 19:17 ` Stefan Monnier 1 sibling, 0 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-09-27 5:37 UTC (permalink / raw) To: Yuan Fu Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake, john > From: Yuan Fu <casouri@gmail.com> > Date: Sun, 26 Sep 2021 21:42:36 -0700 > Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>, > Theodor Thornhill <theo@thornhill.no>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Emacs developers <emacs-devel@gnu.org>, > Stefan Monnier <monnier@iro.umontreal.ca>, > Stephen Leake <stephen_leake@stephe-leake.org>, > John Yates <john@yates-sheets.org> > > Currently, because font-lock.el uses functions and variables defined in tree-sitter.el, it needs to require tree-sitter.el. But only conditioned on some variable, right? > Should we require tree-sitter.el by default? No, but you could require it on the same condition that makes font-lock use tree-sitter functions. > Then what do we do when tree-sitter is not available on the system? Should I wrap every reference to tree-sitter in font-lock.el with (when (featurep ’tree-sitter))? Or is there better ways to deal with this? Again, you probably already have a condition that wraps it, no? > Another approach is to define everything tree-sitter related in tree-sitter.el, and make tree-sitter.el require font-lock.el instead of the other way around. Would that be better? If it works, yes. But I'm not sure I understand the details well enough for my answer to be correct. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-27 4:42 ` Yuan Fu 2021-09-27 5:37 ` Eli Zaretskii @ 2021-09-27 19:17 ` Stefan Monnier 2021-09-28 5:33 ` Yuan Fu 1 sibling, 1 reply; 370+ messages in thread From: Stefan Monnier @ 2021-09-27 19:17 UTC (permalink / raw) To: Yuan Fu Cc: Eli Zaretskii, Tuấn-Anh Nguyễn, Theodor Thornhill, Clément Pit-Claudel, Emacs developers, Stephen Leake, John Yates > Currently, because font-lock.el uses functions and variables defined in > tree-sitter.el, Why? I don't see any reason why you'd need to change font-lock.el to add support for tree-sitter fontification. Stefan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-27 19:17 ` Stefan Monnier @ 2021-09-28 5:33 ` Yuan Fu 2021-09-28 7:02 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-09-28 5:33 UTC (permalink / raw) To: Stefan Monnier Cc: Tuấn-Anh Nguyễn, Theodor Thornhill, Clément Pit-Claudel, Emacs developers, Eli Zaretskii, Stephen Leake, John Yates > On Sep 27, 2021, at 12:17 PM, Stefan Monnier <monnier@iro.umontreal.ca> wrote: > >> Currently, because font-lock.el uses functions and variables defined in >> tree-sitter.el, > > Why? I don't see any reason why you'd need to change font-lock.el to > add support for tree-sitter fontification. > [Also in reply to Eli:] Before tree-sitter, font-lock roughly consists of two passes, the syntactic pass (that uses the syntax table) and regex pass (that uses regex matching). I added a three pass, tree-sitter pass, because I want to add tree-sitter fontification on top of existing mechanisms, not replacing it. This way we can still fontify keywords. Simply replacing font-lock with tree-sitter font-lock would cause anything relying on the existing fontification facility stop to work if I turn on tree-sitter. For example, I use a package that fontify keywords like “TODO” and “FIXME”, it would be a shame if it stops working as soon as I turn on tree-sitter fontification. So it seemed natural to me to augment font-lock.el instead of putting stuff in tree-sitter.el. Now that I realized the dependency problem, I think I can move all the font-lock integration code to tree-sitter.el and leave font-lock.el untouched, but still maintain the augmentation nature. E.g., define a tree-sitter-fontify-region-function that first calls font-lock-fontify-region-function, then does tree-sitter fontification. And user can turn on tree-sitter font-lock with, say tree-sitter-font-lock-mode. With that said, I still have one thing not too sure. What should tree-sitter.el do if libtree-sitter is not on the system, and tree-sitter.c is not included in Emacs? Should we simply not include tree-sitter.el? Is there existing build facility that can do that (exclude tree-sitter.el when libtree-sitter is not found on system)? Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-28 5:33 ` Yuan Fu @ 2021-09-28 7:02 ` Eli Zaretskii 2021-09-28 16:10 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-09-28 7:02 UTC (permalink / raw) To: Yuan Fu Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake, john > From: Yuan Fu <casouri@gmail.com> > Date: Mon, 27 Sep 2021 22:33:17 -0700 > Cc: Eli Zaretskii <eliz@gnu.org>, > Tuấn-Anh Nguyễn <ubolonton@gmail.com>, > Theodor Thornhill <theo@thornhill.no>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Emacs developers <emacs-devel@gnu.org>, > Stephen Leake <stephen_leake@stephe-leake.org>, > John Yates <john@yates-sheets.org> > > With that said, I still have one thing not too sure. What should tree-sitter.el do if libtree-sitter is not on the system, and tree-sitter.c is not included in Emacs? Should we simply not include tree-sitter.el? Is there existing build facility that can do that (exclude tree-sitter.el when libtree-sitter is not found on system)? I don't think I understand the problem: why would you need "not to include" tree-sitter.el? We have quite a few *.el files that need support from built-ins which could not be available at run time, and yet we don't hesitate to include those *.el files. How is this case different? I guess some details of what bothers you are missing. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-28 7:02 ` Eli Zaretskii @ 2021-09-28 16:10 ` Yuan Fu 2021-09-28 16:28 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-09-28 16:10 UTC (permalink / raw) To: Eli Zaretskii Cc: Tuấn-Anh Nguyễn, Theodor Thornhill, Clément Pit-Claudel, Emacs developers, Stefan Monnier, Stephen Leake, john > On Sep 28, 2021, at 12:02 AM, Eli Zaretskii <eliz@gnu.org> wrote: > >> From: Yuan Fu <casouri@gmail.com> >> Date: Mon, 27 Sep 2021 22:33:17 -0700 >> Cc: Eli Zaretskii <eliz@gnu.org>, >> Tuấn-Anh Nguyễn <ubolonton@gmail.com>, >> Theodor Thornhill <theo@thornhill.no>, >> Clément Pit-Claudel <cpitclaudel@gmail.com>, >> Emacs developers <emacs-devel@gnu.org>, >> Stephen Leake <stephen_leake@stephe-leake.org>, >> John Yates <john@yates-sheets.org> >> >> With that said, I still have one thing not too sure. What should tree-sitter.el do if libtree-sitter is not on the system, and tree-sitter.c is not included in Emacs? Should we simply not include tree-sitter.el? Is there existing build facility that can do that (exclude tree-sitter.el when libtree-sitter is not found on system)? > > I don't think I understand the problem: why would you need "not to > include" tree-sitter.el? We have quite a few *.el files that need > support from built-ins which could not be available at run time, and > yet we don't hesitate to include those *.el files. How is this case > different? I guess some details of what bothers you are missing. Nothing in particular except the naive assumption that we won’t provide functions that don’t work. I didn’t know that we have quite a few *.el files that could potentially not work before. Do you have some examples? Anyway, I can provide a function tree-sitter-avaliable-p similar to native-compilation, that way a user knows if he can use tree-sitter features. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-28 16:10 ` Yuan Fu @ 2021-09-28 16:28 ` Eli Zaretskii 2021-12-13 6:54 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-09-28 16:28 UTC (permalink / raw) To: Yuan Fu Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake, john > From: Yuan Fu <casouri@gmail.com> > Date: Tue, 28 Sep 2021 09:10:32 -0700 > Cc: Stefan Monnier <monnier@iro.umontreal.ca>, > Tuấn-Anh Nguyễn <ubolonton@gmail.com>, > Theodor Thornhill <theo@thornhill.no>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Emacs developers <emacs-devel@gnu.org>, > Stephen Leake <stephen_leake@stephe-leake.org>, > john@yates-sheets.org > > > I don't think I understand the problem: why would you need "not to > > include" tree-sitter.el? We have quite a few *.el files that need > > support from built-ins which could not be available at run time, and > > yet we don't hesitate to include those *.el files. How is this case > > different? I guess some details of what bothers you are missing. > > Nothing in particular except the naive assumption that we won’t provide functions that don’t work. I didn’t know that we have quite a few *.el files that could potentially not work before. Do you have some examples? Examples include native-compilation (comp.el), xwidgets (xwidget.el), and threads (thread.el). > Anyway, I can provide a function tree-sitter-avaliable-p similar to native-compilation, that way a user knows if he can use tree-sitter features. Yes, that's generally what optional packages do. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-09-28 16:28 ` Eli Zaretskii @ 2021-12-13 6:54 ` Yuan Fu 2021-12-13 12:56 ` Eli Zaretskii 2021-12-18 13:39 ` Daniel Martín 0 siblings, 2 replies; 370+ messages in thread From: Yuan Fu @ 2021-12-13 6:54 UTC (permalink / raw) To: Eli Zaretskii Cc: ubolonton, theo, cpitclaudel, emacs-devel, Stefan Monnier, stephen_leake, john It’s been a while and no one provided further comments on the indent and font-lock integration of tree-sitter, so I finished the manuals for indent and font-lock integration. They are under 24.6 Font Lock Mode and 24.7 Automatic Indentation of code. Once the author of tree-sitter allow tree-sitter to change malloc implementation at runtime, tree-sitter integration will be ready. (Though I suspect that won’t come soon. The author is still actively developing tree-sitter but he didn’t reply to my request.) As before, the code is at https://github.com/casouri/emacs on ts branch. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-12-13 6:54 ` Yuan Fu @ 2021-12-13 12:56 ` Eli Zaretskii 2021-12-14 7:19 ` Yuan Fu 2021-12-18 14:45 ` Philipp 2021-12-18 13:39 ` Daniel Martín 1 sibling, 2 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-12-13 12:56 UTC (permalink / raw) To: Yuan Fu Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake, john > From: Yuan Fu <casouri@gmail.com> > Date: Sun, 12 Dec 2021 22:54:59 -0800 > Cc: Stefan Monnier <monnier@iro.umontreal.ca>, > ubolonton@gmail.com, > theo@thornhill.no, > cpitclaudel@gmail.com, > emacs-devel@gnu.org, > stephen_leake@stephe-leake.org, > john@yates-sheets.org > > It’s been a while and no one provided further comments on the indent and font-lock integration of tree-sitter, so I finished the manuals for indent and font-lock integration. They are under 24.6 Font Lock Mode and 24.7 Automatic Indentation of code. Once the author of tree-sitter allow tree-sitter to change malloc implementation at runtime, tree-sitter integration will be ready. (Though I suspect that won’t come soon. The author is still actively developing tree-sitter but he didn’t reply to my request.) Would you please ping the authors and tell them that this single issue prevents us from integrating TS into Emacs? Maybe that would change their priorities. I cannot imagine that the feature we are asking is hard to implement. > As before, the code is at https://github.com/casouri/emacs on ts branch. Thanks. Perhaps people could try testing the branch and providing feedback? ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-12-13 12:56 ` Eli Zaretskii @ 2021-12-14 7:19 ` Yuan Fu 2021-12-17 0:14 ` Yuan Fu 2021-12-18 14:45 ` Philipp 1 sibling, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-12-14 7:19 UTC (permalink / raw) To: Eli Zaretskii Cc: Tuấn-Anh Nguyễn, Theodor Thornhill, Clément Pit-Claudel, Emacs developers, Stefan Monnier, Stephen Leake, john >> >> It’s been a while and no one provided further comments on the indent and font-lock integration of tree-sitter, so I finished the manuals for indent and font-lock integration. They are under 24.6 Font Lock Mode and 24.7 Automatic Indentation of code. Once the author of tree-sitter allow tree-sitter to change malloc implementation at runtime, tree-sitter integration will be ready. (Though I suspect that won’t come soon. The author is still actively developing tree-sitter but he didn’t reply to my request.) > > Would you please ping the authors and tell them that this single issue > prevents us from integrating TS into Emacs? Maybe that would change > their priorities. I cannot imagine that the feature we are asking is > hard to implement. Done. >> As before, the code is at https://github.com/casouri/emacs on ts branch. > > Thanks. Perhaps people could try testing the branch and providing > feedback? Yes. Now that the manual is complete, people are welcome to try it out and see what they like and don’t like. It would be even better if someone wants to implement some major modes with the new tree-sitter features. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-12-14 7:19 ` Yuan Fu @ 2021-12-17 0:14 ` Yuan Fu 2021-12-17 7:15 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-12-17 0:14 UTC (permalink / raw) To: Eli Zaretskii Cc: Tuấn-Anh Nguyễn, Theodor Thornhill, Clément Pit-Claudel, Emacs developers, Stefan Monnier, Stephen Leake, john > On Dec 13, 2021, at 11:19 PM, Yuan Fu <casouri@gmail.com> wrote: > >>> >>> It’s been a while and no one provided further comments on the indent and font-lock integration of tree-sitter, so I finished the manuals for indent and font-lock integration. They are under 24.6 Font Lock Mode and 24.7 Automatic Indentation of code. Once the author of tree-sitter allow tree-sitter to change malloc implementation at runtime, tree-sitter integration will be ready. (Though I suspect that won’t come soon. The author is still actively developing tree-sitter but he didn’t reply to my request.) >> >> Would you please ping the authors and tell them that this single issue >> prevents us from integrating TS into Emacs? Maybe that would change >> their priorities. I cannot imagine that the feature we are asking is >> hard to implement. > > Done. Someone commented on my request saying > Had this issue as well, but thought was too niche to open an issue. The standard way to change the allocator at runtime is with the LD_PRELOAD envvar (see mimalloc or any allocator doc). IIUC it is more of a user-feature right? Like you will use LD_PRELOAD=xxx program but not change the environment programmatically in the program? Could Emacs do this should tree-sitter doesn’t want to change? BTW the conversation is at https://github.com/tree-sitter/tree-sitter/issues/1535 The author suggested to implement runtime change of malloc on top of current macros, but I think he missed the point (we don’t want to maintain our own version of tree-sitter). Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-12-17 0:14 ` Yuan Fu @ 2021-12-17 7:15 ` Eli Zaretskii 0 siblings, 0 replies; 370+ messages in thread From: Eli Zaretskii @ 2021-12-17 7:15 UTC (permalink / raw) To: Yuan Fu Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake, john > From: Yuan Fu <casouri@gmail.com> > Date: Thu, 16 Dec 2021 16:14:52 -0800 > Cc: Stefan Monnier <monnier@iro.umontreal.ca>, > Tuấn-Anh Nguyễn <ubolonton@gmail.com>, > Theodor Thornhill <theo@thornhill.no>, > Clément Pit-Claudel <cpitclaudel@gmail.com>, > Emacs developers <emacs-devel@gnu.org>, > Stephen Leake <stephen_leake@stephe-leake.org>, > john@yates-sheets.org > > Someone commented on my request saying > > > Had this issue as well, but thought was too niche to open an issue. The standard way to change the allocator at runtime is with the LD_PRELOAD envvar (see mimalloc or any allocator doc). > > IIUC it is more of a user-feature right? Like you will use LD_PRELOAD=xxx program but not change the environment programmatically in the program? Could Emacs do this should tree-sitter doesn’t want to change? I don't think we want to use LD_PRELOAD for this, for several good reasons. It's non-portable, for starters. > The author suggested to implement runtime change of malloc on top of current macros, but I think he missed the point (we don’t want to maintain our own version of tree-sitter). Yes. I hope we get a better response from the developers of Tree-sitter. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-12-13 12:56 ` Eli Zaretskii 2021-12-14 7:19 ` Yuan Fu @ 2021-12-18 14:45 ` Philipp 2021-12-18 14:57 ` Eli Zaretskii 1 sibling, 1 reply; 370+ messages in thread From: Philipp @ 2021-12-18 14:45 UTC (permalink / raw) To: Eli Zaretskii Cc: Yuan Fu, theo, ubolonton, emacs-devel, cpitclaudel, monnier, stephen_leake, john > Am 13.12.2021 um 13:56 schrieb Eli Zaretskii <eliz@gnu.org>: > >> From: Yuan Fu <casouri@gmail.com> >> Date: Sun, 12 Dec 2021 22:54:59 -0800 >> Cc: Stefan Monnier <monnier@iro.umontreal.ca>, >> ubolonton@gmail.com, >> theo@thornhill.no, >> cpitclaudel@gmail.com, >> emacs-devel@gnu.org, >> stephen_leake@stephe-leake.org, >> john@yates-sheets.org >> >> It’s been a while and no one provided further comments on the indent and font-lock integration of tree-sitter, so I finished the manuals for indent and font-lock integration. They are under 24.6 Font Lock Mode and 24.7 Automatic Indentation of code. Once the author of tree-sitter allow tree-sitter to change malloc implementation at runtime, tree-sitter integration will be ready. (Though I suspect that won’t come soon. The author is still actively developing tree-sitter but he didn’t reply to my request.) > > Would you please ping the authors and tell them that this single issue > prevents us from integrating TS into Emacs? Maybe that would change > their priorities. I cannot imagine that the feature we are asking is > hard to implement. That feature in itself won't be enough. Even with it, TreeSitter will have the same problem as GMP: allocation isn't allowed to fail, and longjmp'ing out of it isn't allowed and generally causes undefined behavior. What's needed is a rewrite of the TreeSitter code so that it handles allocation failure properly and gracefully by returning an error to the caller. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-12-18 14:45 ` Philipp @ 2021-12-18 14:57 ` Eli Zaretskii 2021-12-19 2:51 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-12-18 14:57 UTC (permalink / raw) To: Philipp Cc: casouri, theo, ubolonton, emacs-devel, cpitclaudel, monnier, stephen_leake, john > From: Philipp <p.stephani2@gmail.com> > Date: Sat, 18 Dec 2021 15:45:18 +0100 > Cc: Yuan Fu <casouri@gmail.com>, > ubolonton@gmail.com, > theo@thornhill.no, > cpitclaudel@gmail.com, > emacs-devel@gnu.org, > monnier@iro.umontreal.ca, > stephen_leake@stephe-leake.org, > john@yates-sheets.org > > > Would you please ping the authors and tell them that this single issue > > prevents us from integrating TS into Emacs? Maybe that would change > > their priorities. I cannot imagine that the feature we are asking is > > hard to implement. > > That feature in itself won't be enough. Even with it, TreeSitter will have the same problem as GMP: allocation isn't allowed to fail, and longjmp'ing out of it isn't allowed and generally causes undefined behavior. It may not be enough to satisfy purists, but it's enough to allow the user to save the session and shut down Emacs in an orderly fashion, instead of abruptly exiting and losing all the edits. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-12-18 14:57 ` Eli Zaretskii @ 2021-12-19 2:51 ` Yuan Fu 2021-12-19 7:11 ` Eli Zaretskii 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-12-19 2:51 UTC (permalink / raw) To: Eli Zaretskii Cc: ubolonton, theo, cpitclaudel, emacs-devel, Philipp, monnier, stephen_leake, john >> >>> Would you please ping the authors and tell them that this single issue >>> prevents us from integrating TS into Emacs? Maybe that would change >>> their priorities. I cannot imagine that the feature we are asking is >>> hard to implement. >> >> That feature in itself won't be enough. Even with it, TreeSitter will have the same problem as GMP: allocation isn't allowed to fail, and longjmp'ing out of it isn't allowed and generally causes undefined behavior. > > It may not be enough to satisfy purists, but it's enough to allow the > user to save the session and shut down Emacs in an orderly fashion, > instead of abruptly exiting and losing all the edits. Uses can set tree-sitter-maximum-size to limit memory usage of tree-sitter. Buffers with size larger than that cannot enable tree-sitter. That doesn’t solve the problem directly but should let users avoid allocation failing most of the time. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-12-19 2:51 ` Yuan Fu @ 2021-12-19 7:11 ` Eli Zaretskii 2021-12-19 7:52 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Eli Zaretskii @ 2021-12-19 7:11 UTC (permalink / raw) To: Yuan Fu Cc: ubolonton, theo, cpitclaudel, emacs-devel, p.stephani2, monnier, stephen_leake, john > From: Yuan Fu <casouri@gmail.com> > Date: Sat, 18 Dec 2021 18:51:25 -0800 > Cc: Philipp <p.stephani2@gmail.com>, > ubolonton@gmail.com, > theo@thornhill.no, > cpitclaudel@gmail.com, > emacs-devel@gnu.org, > monnier@iro.umontreal.ca, > stephen_leake@stephe-leake.org, > john@yates-sheets.org > > >> That feature in itself won't be enough. Even with it, TreeSitter will have the same problem as GMP: allocation isn't allowed to fail, and longjmp'ing out of it isn't allowed and generally causes undefined behavior. > > > > It may not be enough to satisfy purists, but it's enough to allow the > > user to save the session and shut down Emacs in an orderly fashion, > > instead of abruptly exiting and losing all the edits. > > Uses can set tree-sitter-maximum-size to limit memory usage of tree-sitter. Buffers with size larger than that cannot enable tree-sitter. That doesn’t solve the problem directly but should let users avoid allocation failing most of the time. Btw, we should have a good idea how frequent this out-of-memory problem could be with tree-sitter. Did someone try to scroll through all of xdisp.c, using tree-sitter for C Mode fontifications, and measured the memory footprint that produces? If not, I think it would be a good idea to try. If the OOM problem happens frequently with large source files, it may indeed be the case that we will need to disable tree-sitter up front based on some size criteria. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-12-19 7:11 ` Eli Zaretskii @ 2021-12-19 7:52 ` Yuan Fu 2021-12-24 10:04 ` Yoav Marco 0 siblings, 1 reply; 370+ messages in thread From: Yuan Fu @ 2021-12-19 7:52 UTC (permalink / raw) To: Eli Zaretskii Cc: Tuấn-Anh Nguyễn, Theodor Thornhill, Clément Pit-Claudel, Emacs developers, Philipp, Stefan Monnier, Stephen Leake, john > On Dec 18, 2021, at 11:11 PM, Eli Zaretskii <eliz@gnu.org> wrote: > >> From: Yuan Fu <casouri@gmail.com> >> Date: Sat, 18 Dec 2021 18:51:25 -0800 >> Cc: Philipp <p.stephani2@gmail.com>, >> ubolonton@gmail.com, >> theo@thornhill.no, >> cpitclaudel@gmail.com, >> emacs-devel@gnu.org, >> monnier@iro.umontreal.ca, >> stephen_leake@stephe-leake.org, >> john@yates-sheets.org >> >>>> That feature in itself won't be enough. Even with it, TreeSitter will have the same problem as GMP: allocation isn't allowed to fail, and longjmp'ing out of it isn't allowed and generally causes undefined behavior. >>> >>> It may not be enough to satisfy purists, but it's enough to allow the >>> user to save the session and shut down Emacs in an orderly fashion, >>> instead of abruptly exiting and losing all the edits. >> >> Uses can set tree-sitter-maximum-size to limit memory usage of tree-sitter. Buffers with size larger than that cannot enable tree-sitter. That doesn’t solve the problem directly but should let users avoid allocation failing most of the time. > > Btw, we should have a good idea how frequent this out-of-memory > problem could be with tree-sitter. Did someone try to scroll through > all of xdisp.c, using tree-sitter for C Mode fontifications, and > measured the memory footprint that produces? If not, I think it would > be a good idea to try. > > If the OOM problem happens frequently with large source files, it may > indeed be the case that we will need to disable tree-sitter up front > based on some size criteria. From the author’s quote and my experiments, tree-sitter uses about 10–20x memory of the buffer size. So xdisp.c is fine. Also you don’t need to scroll through the buffer, tree-sitter parses the whole buffer up-front. Yuan ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-12-19 7:52 ` Yuan Fu @ 2021-12-24 10:04 ` Yoav Marco 2021-12-24 10:21 ` Yoav Marco 0 siblings, 1 reply; 370+ messages in thread From: Yoav Marco @ 2021-12-24 10:04 UTC (permalink / raw) To: casouri Cc: cpitclaudel, theo, ubolonton, emacs-devel, p.stephani2, monnier, eliz, stephen_leake, john Hi, Yuan and I had a discussion on github https://github.com/casouri/emacs/issues/5 and he suggested we move here. I'm quoting our comments for conveniece. Yoav Marco <yoavm448@gmail.com> writes: > Hi! My question is about the lines: > https://github.com/casouri/emacs/blob/a4f90c5f95476914fb8789c67652af1025644af8/src/tree-sitter.c#L1375-L1380 > > /* TODO: We could cache the query object, so that repeatedly > querying with the same query can reuse the query object. It also > saves us from expanding the sexp query into a string. I don't > know how much time that could save though. */ > TSQuery *ts_query = ts_query_new (lang, source, strlen (source), > &error_offset, &error_type); > > > Regarding error handling mostly. > > In this branch queries are saved as *strings* and compiled in the internals on > each use. In elisp-tree-sitter, you call `tsc-make-query` and use the object > it returns for calls to tsc-query-captures which is the analog for > tree-sitter-query-capture. > > What happens if your query is deformed, or simply has a typo in a node name? > We call `tree-sitter-query-capture` on each keystroke in > `tree-sitter-font-lock-fontify-region`. With the compilation occurring > ahead-of-time it would fail once, but here wouldn't it barrage you with > errors? > > Especially with patterns that aren't set in stone and can be modified like > font-lock keywords, I think compiling the query when the pattern is added is > better than on each execution. > > One nice thing though about compiling queries only when queried is that you > can call `ts_query_delete` straight away. With users compiling queries it > would need to be up to garbage collection, I think. Yuan Fu <notifications@github.com> writes: >> What happens if your query is deformed, or simply has a typo in a node name? >> We call tree-sitter-query-capture on each keystroke in >> tree-sitter-font-lock-fontify-region. With the compilation occurring >> ahead-of-time it would fail once, but here wouldn't it barrage you with >> errors? > > Not quite barraging, jit-lock will just silently fail and leave a bunch of > logs in Messages. I don't think error out when calling > tree-sitter-query-capture is a grave problem, since 1) it doesn't barrage as > you worried and 2) I don't expect queries in major modes to ship wrong code: > it's not like a bug that could go undiscovered, if the query has a typo, the > major mode writer will certainly find out when he/she tries to fontify a > buffer. > > I can see some advantages to compile the query ahead of time. 1) It would be > helpful to know there is an error before calling > tree-sitter-font-lock-fontify-region and see an unfontified buffer, not > knowing what went wrong. I can add a function, say, tree-sitter-compile-query > that checks a query (as in query pattern) and passes it on if its correct. 2) > It could potentially saves recompilation of the query. But computing the query > most probably takes negligible time. > > On the other hand, compiling the query has downsides: I don't know what does > tsc-make-query return, I assume an internal object? I try to minimize the > number of new object types I introduce to Emacs, for hygiene. So far I've > managed to add only parser object and node object. If there aren't good > reasons I'm inclined to not add a query object. So far the advantages that I > see aren't very convincing. > > If you want to continue the discussion, I suggest we continue at emacs-devel, > that way others who are more knowledgable than I can join and offer their > opinion. ^ permalink raw reply [flat|nested] 370+ messages in thread
* Re: Tree-sitter api 2021-12-24 10:04 ` Yoav Marco @ 2021-12-24 10:21 ` Yoav Marco 2021-12-25 8:31 ` Yuan Fu 0 siblings, 1 reply; 370+ messages in thread From: Yoav Marco @ 2021-12-24 10:21 UTC (permalink / raw) To: casouri Cc: cpitclaudel, theo, ubolonton, emacs-devel, p.stephani2, monnier, eliz, stephen_leake, john > Yuan Fu <notifications@github.com> writes: >> I can see some advantages to compile the query ahead of time. 1) It would be >> helpful to know there is an error before calling >> tree-sitter-font-lock-fontify-region and see an unfontified buffer, not >> knowing what went wrong. I can add a function, say, tree-sitter-compile-query >> that checks a query (as in query pattern) and passes it on if its correct. 2) >> It could potentially saves recompilation of the query. But computing the query >> most probably takes negligible time. I'll try to benchmark it. Would be great if it really is nothing. >> On the other hand, compiling the query has downsides: I don't know what does >> tsc-make-query return, I assume an internal object? I try to minimize the >> number of new object types I introduce to Emacs, for hygiene. So far I've >> managed to add only parser object and node object. If there aren't good >> reasons I'm inclined to not add a query object. So far the advantages that I >> see aren't very convincing. Yeah, it returns a user-pointer. ^ permalink raw reply [flat|nested] 370+ messages in thread