unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* How to add pseudo vector types
@ 2021-07-14 17:37 Yuan Fu
  2021-07-14 17:44 ` Eli Zaretskii
  2021-07-14 17:47 ` Stefan Monnier
  0 siblings, 2 replies; 284+ messages in thread
From: Yuan Fu @ 2021-07-14 17:37 UTC (permalink / raw)
  To: emacs-devel

Say I want to expose tree-sitter’s parser to lisp, and I define it as a new pseudo vector.

struct Lisp_TS_Parser
{
  union vectorlike_header header;
  Lisp_Object buffer;
  TSParser *parser;
  TSTree *tree;
  TSInput input;
};

Now if I want to return a Lisp_Object, do I initialize this struct and cast it into a Lisp_Object and return it? Like:

Lisp_TS_parser lisp_parser;
...
return (Lisp_Object)lisp_parser;


And how do I use a USER_PTR? Do I cast it into (struct Lisp_User_Ptr) and use it normally, or is there some helper function that I should use?

Are there examples of using pseudo vectors? Thanks

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-14 17:37 How to add pseudo vector types Yuan Fu
@ 2021-07-14 17:44 ` Eli Zaretskii
  2021-07-14 17:47 ` Stefan Monnier
  1 sibling, 0 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-14 17:44 UTC (permalink / raw)
  To: Yuan Fu; +Cc: emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Wed, 14 Jul 2021 13:37:47 -0400
> 
> Say I want to expose tree-sitter’s parser to lisp, and I define it as a new pseudo vector.
> 
> struct Lisp_TS_Parser
> {
>   union vectorlike_header header;
>   Lisp_Object buffer;
>   TSParser *parser;
>   TSTree *tree;
>   TSInput input;
> };

Inside Emacs, or in a module?  I assume the former.

> Now if I want to return a Lisp_Object, do I initialize this struct and cast it into a Lisp_Object and return it? Like:
> 
> Lisp_TS_parser lisp_parser;
> ...
> return (Lisp_Object)lisp_parser;

No, you need to define a proper Lisp_Object, and then define
functions/macros to make a Lisp_Object that represents the struct, and
vice versa.

> And how do I use a USER_PTR? Do I cast it into (struct Lisp_User_Ptr) and use it normally, or is there some helper function that I should use?

Look in lisp.h, you will find some infrastructure there.

> Are there examples of using pseudo vectors?

Every buffer, window, frame, and overlay is a pseudo vector.  Look how
these are handled in lisp.h and in the rest of the code, and you will
find a lot of examples.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-14 17:37 How to add pseudo vector types Yuan Fu
  2021-07-14 17:44 ` Eli Zaretskii
@ 2021-07-14 17:47 ` Stefan Monnier
  2021-07-14 23:48   ` Yuan Fu
  1 sibling, 1 reply; 284+ messages in thread
From: Stefan Monnier @ 2021-07-14 17:47 UTC (permalink / raw)
  To: Yuan Fu; +Cc: emacs-devel

Yuan Fu [2021-07-14 13:37:47] wrote:
> Say I want to expose tree-sitter’s parser to lisp, and I define it as a new pseudo vector.
>
> struct Lisp_TS_Parser
> {
>   union vectorlike_header header;
>   Lisp_Object buffer;
>   TSParser *parser;
>   TSTree *tree;
>   TSInput input;
> };
>
> Now if I want to return a Lisp_Object, do I initialize this struct and cast
> it into a Lisp_Object and return it? Like:
>
> Lisp_TS_parser lisp_parser;
> ...
> return (Lisp_Object)lisp_parser;
>
>
> And how do I use a USER_PTR? Do I cast it into (struct Lisp_User_Ptr) and
> use it normally, or is there some helper function that I should use?

Most likely you'll want some of your functions to take objects that
should be "tree sitter parsers", but you'll only receive a Lisp_Object
so you'll need to be able to *test* that the object you received is
indeed a "tree sitter parser".

For that reason you'll probably want to add a new entry to `pvec_type`
rather than use a USER_PTR.

> Are there examples of using pseudo vectors? Thanks

Lots of them: actual vectors, processes, threads, mutexes, overlays, you
name it.


        Stefan




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-14 17:47 ` Stefan Monnier
@ 2021-07-14 23:48   ` Yuan Fu
  2021-07-15  0:26     ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-07-14 23:48 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 974 bytes --]

>> 
>> And how do I use a USER_PTR? Do I cast it into (struct Lisp_User_Ptr) and
>> use it normally, or is there some helper function that I should use?
> 
> Most likely you'll want some of your functions to take objects that
> should be "tree sitter parsers", but you'll only receive a Lisp_Object
> so you'll need to be able to *test* that the object you received is
> indeed a "tree sitter parser".
> 
> For that reason you'll probably want to add a new entry to `pvec_type`
> rather than use a USER_PTR.


Actually, what is the correct way to provide a pointer from a dynamic module to Emacs core? I tried to use USER_PTR, but the dynamic module can only return an emacs_value, and to convert an emacs_value to a Lisp_Object, I need to use value_to_lisp, which is not exposed by emacs-module.c.

I want to provide individual tree-sitter language definitions from dynamic modules so that one don’t need to compile Emacs with language definitions.

Yuan

[-- Attachment #2: Type: text/html, Size: 6845 bytes --]

^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-14 23:48   ` Yuan Fu
@ 2021-07-15  0:26     ` Yuan Fu
  2021-07-15  2:48       ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-07-15  0:26 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1159 bytes --]



> On Jul 14, 2021, at 7:48 PM, Yuan Fu <casouri@gmail.com> wrote:
> 
>>> 
>>> And how do I use a USER_PTR? Do I cast it into (struct Lisp_User_Ptr) and
>>> use it normally, or is there some helper function that I should use?
>> 
>> Most likely you'll want some of your functions to take objects that
>> should be "tree sitter parsers", but you'll only receive a Lisp_Object
>> so you'll need to be able to *test* that the object you received is
>> indeed a "tree sitter parser".
>> 
>> For that reason you'll probably want to add a new entry to `pvec_type`
>> rather than use a USER_PTR.
> 
> 
> Actually, what is the correct way to provide a pointer from a dynamic module to Emacs core? I tried to use USER_PTR, but the dynamic module can only return an emacs_value, and to convert an emacs_value to a Lisp_Object, I need to use value_to_lisp, which is not exposed by emacs-module.c.
> 
> I want to provide individual tree-sitter language definitions from dynamic modules so that one don’t need to compile Emacs with language definitions.

I just realized that I can regard emacs_value just as Lisp_Object. Is that right?

Yuan


[-- Attachment #2: Type: text/html, Size: 7416 bytes --]

^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-15  0:26     ` Yuan Fu
@ 2021-07-15  2:48       ` Yuan Fu
  2021-07-15  6:39         ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-07-15  2:48 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 930 bytes --]

I defined two pseudo vectors for tree-sitter's parser and node and packaged a dynamic module for tree-sitter’s json language definition. I also wrapped a few tree-sitter functions just to test if everything works. Please have a look. I’m sure there are some problems because I mainly wrote by copy, paste and modifying from other code I found in Emacs source.

To try out this patch, get tree-sitter from https://github.com/tree-sitter/tree-sitter.git <https://github.com/tree-sitter/tree-sitter.git>, make and make install it. Then unzip json-module.zip to get the source of the json dynamic module. If my Makefile is correct, make'ing it should produce a tree-sitter-json.so. Then if you apply ts.patch, compile emacs, and run this snippet, you should get a string representation of the root node.

(require 'tree-sitter-json)
(tree-sitter-node-string (tree-sitter-parse "[1,2]" (tree-sitter-json)))

Yuan






[-- Attachment #2.1: Type: text/html, Size: 1359 bytes --]

[-- Attachment #2.2: ts.patch --]
[-- Type: application/octet-stream, Size: 15270 bytes --]

From 85baf92975224ea99b7f68d5854342803c61f1d1 Mon Sep 17 00:00:00 2001
From: Yuan Fu <casouri@gmail.com>
Date: Wed, 14 Jul 2021 22:26:42 -0400
Subject: [PATCH] checkpoint

---
 configure.ac      |  27 ++++++++-
 src/Makefile.in   |  11 +++-
 src/alloc.c       |  13 +++++
 src/emacs.c       |   4 ++
 src/lisp.h        |   2 +
 src/print.c       |  17 ++++++
 src/tree_sitter.c | 145 ++++++++++++++++++++++++++++++++++++++++++++++
 src/tree_sitter.h |  87 ++++++++++++++++++++++++++++
 8 files changed, 302 insertions(+), 4 deletions(-)
 create mode 100644 src/tree_sitter.c
 create mode 100644 src/tree_sitter.h

diff --git a/configure.ac b/configure.ac
index 830f33844b..42d2d43455 100644
--- a/configure.ac
+++ b/configure.ac
@@ -454,6 +454,7 @@ AC_DEFUN
 OPTION_DEFAULT_OFF([imagemagick],[compile with ImageMagick image support])
 OPTION_DEFAULT_ON([native-image-api], [don't use native image APIs (GDI+ on Windows)])
 OPTION_DEFAULT_IFAVAILABLE([json], [compile with native JSON support])
+OPTION_DEFAULT_IFAVAILABLE([tree-sitter], [compile with tree-sitter])
 
 OPTION_DEFAULT_ON([xft],[don't use XFT for anti aliased fonts])
 OPTION_DEFAULT_ON([harfbuzz],[don't use HarfBuzz for text shaping])
@@ -2963,6 +2964,23 @@ AC_DEFUN
 AC_SUBST(JSON_CFLAGS)
 AC_SUBST(JSON_OBJ)
 
+HAVE_TREE_SITTER=no
+TREE_SITTER_OBJ=
+
+if test "${with_tree_sitter}" != "no"; then
+  EMACS_CHECK_MODULES([TREE_SITTER], [tree-sitter >= 0.0],
+    [HAVE_TREE_SITTER=yes], [HAVE_TREE_SITTER=no])
+  if test "${HAVE_TREE_SITTER}" = yes; then
+    AC_DEFINE(HAVE_TREE_SITTER, 1, [Define if using tree-sitter.])
+    TREE_SITTER_LIBS=-ltree-sitter
+    TREE_SITTER_OBJ="tree_sitter.o"
+  fi
+fi
+
+AC_SUBST(TREE_SITTER_LIBS)
+AC_SUBST(TREE_SITTER_CFLAGS)
+AC_SUBST(TREE_SITTER_OBJ)
+
 NOTIFY_OBJ=
 NOTIFY_SUMMARY=no
 
@@ -4028,6 +4046,12 @@ AC_DEFUN
   *) MISSING="$MISSING json"
      WITH_IFAVAILABLE="$WITH_IFAVAILABLE --with-json=ifavailable";;
 esac
+case $with_tree_sitter,$HAVE_TREE_SITTER in
+  no,* | ifavailable,* | *,yes) ;;
+  *) MISSING="$MISSING tree-sitter"
+     WITH_IFAVAILABLE="$WITH_IFAVAILABLE --with-tree-sitter=ifavailable";;
+esac
+
 if test "X${MISSING}" != X; then
   # If we have a missing library, and we don't have pkg-config installed,
   # the missing pkg-config may be the reason.  Give the user a hint.
@@ -5833,7 +5857,7 @@ AC_DEFUN
 optsep=
 emacs_config_features=
 for opt in ACL CAIRO DBUS FREETYPE GCONF GIF GLIB GMP GNUTLS GPM GSETTINGS \
- HARFBUZZ IMAGEMAGICK JPEG JSON LCMS2 LIBOTF LIBSELINUX LIBSYSTEMD LIBXML2 \
+ HARFBUZZ IMAGEMAGICK JPEG JSON TREE-SITTER LCMS2 LIBOTF LIBSELINUX LIBSYSTEMD LIBXML2 \
  M17N_FLT MODULES NATIVE_COMP NOTIFY NS OLDXMENU PDUMPER PNG RSVG SECCOMP \
  SOUND THREADS TIFF \
  TOOLKIT_SCROLL_BARS UNEXEC X11 XAW3D XDBE XFT XIM XPM XWIDGETS X_TOOLKIT \
@@ -5902,6 +5926,7 @@ AC_DEFUN
   Does Emacs use -lxft?                                   ${HAVE_XFT}
   Does Emacs use -lsystemd?                               ${HAVE_LIBSYSTEMD}
   Does Emacs use -ljansson?                               ${HAVE_JSON}
+  Does Emacs use -ltree-sitter?                           ${HAVE_TREE_SITTER}
   Does Emacs use the GMP library?                         ${HAVE_GMP}
   Does Emacs directly use zlib?                           ${HAVE_ZLIB}
   Does Emacs have dynamic modules support?                ${HAVE_MODULES}
diff --git a/src/Makefile.in b/src/Makefile.in
index 79cddb35b5..bfdfda566e 100644
--- a/src/Makefile.in
+++ b/src/Makefile.in
@@ -320,6 +320,10 @@ JSON_LIBS =
 JSON_CFLAGS = @JSON_CFLAGS@
 JSON_OBJ = @JSON_OBJ@
 
+TREE_SITTER_LIBS = @TREE_SITTER_LIBS@
+TREE_SITTER_FLAGS = @TREE_SITTER_FLAGS@
+TREE_SITTER_OBJ = @TREE_SITTER_OBJ@
+
 INTERVALS_H = dispextern.h intervals.h composite.h
 
 GETLOADAVG_LIBS = @GETLOADAVG_LIBS@
@@ -372,7 +376,7 @@ EMACS_CFLAGS=
   $(WEBKIT_CFLAGS) $(LCMS2_CFLAGS) \
   $(SETTINGS_CFLAGS) $(FREETYPE_CFLAGS) $(FONTCONFIG_CFLAGS) \
   $(HARFBUZZ_CFLAGS) $(LIBOTF_CFLAGS) $(M17N_FLT_CFLAGS) $(DEPFLAGS) \
-  $(LIBSYSTEMD_CFLAGS) $(JSON_CFLAGS) \
+  $(LIBSYSTEMD_CFLAGS) $(JSON_CFLAGS) $(TREE_SITTER_CFLAGS) \
   $(LIBGNUTLS_CFLAGS) $(NOTIFY_CFLAGS) $(CAIRO_CFLAGS) \
   $(WERROR_CFLAGS)
 ALL_CFLAGS = $(EMACS_CFLAGS) $(WARN_CFLAGS) $(CFLAGS)
@@ -406,7 +410,8 @@ base_obj =
 	thread.o systhread.o \
 	$(if $(HYBRID_MALLOC),sheap.o) \
 	$(MSDOS_OBJ) $(MSDOS_X_OBJ) $(NS_OBJ) $(CYGWIN_OBJ) $(FONT_OBJ) \
-	$(W32_OBJ) $(WINDOW_SYSTEM_OBJ) $(XGSELOBJ) $(JSON_OBJ)
+	$(W32_OBJ) $(WINDOW_SYSTEM_OBJ) $(XGSELOBJ) $(JSON_OBJ) \
+	$(TREE_SITTER_OBJ)
 obj = $(base_obj) $(NS_OBJC_OBJ)
 
 ## Object files used on some machine or other.
@@ -516,7 +521,7 @@ LIBES =
    $(FREETYPE_LIBS) $(FONTCONFIG_LIBS) $(HARFBUZZ_LIBS) $(LIBOTF_LIBS) $(M17N_FLT_LIBS) \
    $(LIBGNUTLS_LIBS) $(LIB_PTHREAD) $(GETADDRINFO_A_LIBS) $(LCMS2_LIBS) \
    $(NOTIFY_LIBS) $(LIB_MATH) $(LIBZ) $(LIBMODULES) $(LIBSYSTEMD_LIBS) \
-   $(JSON_LIBS) $(LIBGMP) $(LIBGCCJIT)
+   $(JSON_LIBS) $(LIBGMP) $(LIBGCCJIT) $(TREE_SITTER_LIBS)
 
 ## FORCE it so that admin/unidata can decide whether this file is
 ## up-to-date.  Although since charprop depends on bootstrap-emacs,
diff --git a/src/alloc.c b/src/alloc.c
index 76d8c7ddd1..f144e053f2 100644
--- a/src/alloc.c
+++ b/src/alloc.c
@@ -50,6 +50,10 @@ Copyright (C) 1985-1986, 1988, 1993-1995, 1997-2021 Free Software
 #include TERM_HEADER
 #endif /* HAVE_WINDOW_SYSTEM */
 
+#ifdef HAVE_TREE_SITTER
+#include "tree_sitter.h"
+#endif
+
 #include <flexmember.h>
 #include <verify.h>
 #include <execinfo.h>           /* For backtrace.  */
@@ -3144,6 +3148,15 @@ cleanup_vector (struct Lisp_Vector *vector)
       if (uptr->finalizer)
 	uptr->finalizer (uptr->p);
     }
+#ifdef HAVE_TREE_SITTER
+  else if (PSEUDOVECTOR_TYPEP (&vector->header, PVEC_TS_PARSER))
+    {
+      struct Lisp_TS_Parser *lisp_parser
+	= PSEUDOVEC_STRUCT (vector, Lisp_TS_Parser);
+      ts_tree_delete(lisp_parser->tree);
+      ts_parser_delete(lisp_parser->parser);
+    }
+#endif
 #ifdef HAVE_MODULES
   else if (PSEUDOVECTOR_TYPEP (&vector->header, PVEC_MODULE_FUNCTION))
     {
diff --git a/src/emacs.c b/src/emacs.c
index 60a57a693c..ede390231d 100644
--- a/src/emacs.c
+++ b/src/emacs.c
@@ -85,6 +85,7 @@ #define MAIN_PROGRAM
 #include "intervals.h"
 #include "character.h"
 #include "buffer.h"
+#include "tree_sitter.h"
 #include "window.h"
 #include "xwidget.h"
 #include "atimer.h"
@@ -2057,6 +2058,9 @@ main (int argc, char **argv)
       syms_of_floatfns ();
 
       syms_of_buffer ();
+      #ifdef HAVE_TREE_SITTER
+      syms_of_tree_sitter ();
+      #endif
       syms_of_bytecode ();
       syms_of_callint ();
       syms_of_casefiddle ();
diff --git a/src/lisp.h b/src/lisp.h
index 4fb8923678..e439447283 100644
--- a/src/lisp.h
+++ b/src/lisp.h
@@ -1070,6 +1070,8 @@ DEFINE_GDB_SYMBOL_END (PSEUDOVECTOR_FLAG)
   PVEC_CONDVAR,
   PVEC_MODULE_FUNCTION,
   PVEC_NATIVE_COMP_UNIT,
+  PVEC_TS_PARSER,
+  PVEC_TS_NODE,
 
   /* These should be last, for internal_equal and sxhash_obj.  */
   PVEC_COMPILED,
diff --git a/src/print.c b/src/print.c
index d4301fd7b6..e20a1d065a 100644
--- a/src/print.c
+++ b/src/print.c
@@ -48,6 +48,10 @@ Copyright (C) 1985-1986, 1988, 1993-1995, 1997-2021 Free Software
 # include <sys/socket.h> /* for F_DUPFD_CLOEXEC */
 #endif
 
+#ifdef HAVE_TREE_SITTER
+#include "tree_sitter.h"
+#endif
+
 struct terminal;
 
 /* Avoid actual stack overflow in print.  */
@@ -1853,6 +1857,19 @@ print_vectorlike (Lisp_Object obj, Lisp_Object printcharfun, bool escapeflag,
       }
       break;
 #endif
+
+#ifdef HAVE_TREE_SITTER
+    case PVEC_TS_PARSER:
+      print_c_string ("#<tree-sitter-parser in ", printcharfun);
+      print_string (BVAR (XTS_PARSER (obj)->buffer, name), printcharfun);
+      printchar ('>', printcharfun);
+      break;
+    case PVEC_TS_NODE:
+      print_c_string ("#<tree-sitter-node", printcharfun);
+      printchar ('>', printcharfun);
+      break;
+#endif
+
     default:
       emacs_abort ();
     }
diff --git a/src/tree_sitter.c b/src/tree_sitter.c
new file mode 100644
index 0000000000..f2134c571a
--- /dev/null
+++ b/src/tree_sitter.c
@@ -0,0 +1,145 @@
+/* Tree-sitter integration for GNU Emacs.
+
+Copyright (C) 2021 Free Software Foundation, Inc.
+
+This file is part of GNU Emacs.
+
+GNU Emacs is free software: you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation, either version 3 of the License, or (at
+your option) any later version.
+
+GNU Emacs is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GNU Emacs.  If not, see <https://www.gnu.org/licenses/>.  */
+
+#include <config.h>
+
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/param.h>
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include "buffer.h"
+#include "coding.h"
+#include "tree_sitter.h"
+
+/* parser.h defines a macro ADVANCE that conflicts with alloc.c.   */
+#include <tree_sitter/parser.h>
+
+Lisp_Object
+make_ts_parser (struct buffer *buffer, TSParser *parser, TSTree *tree)
+{
+  struct Lisp_TS_Parser *lisp_parser
+    = ALLOCATE_PLAIN_PSEUDOVECTOR (struct Lisp_TS_Parser, PVEC_TS_PARSER);
+  lisp_parser->buffer = buffer;
+  lisp_parser->parser = parser;
+  lisp_parser->tree = tree;
+  // TODO TSInput.
+  return make_lisp_ptr (lisp_parser, Lisp_Vectorlike);
+}
+
+Lisp_Object
+make_ts_node (Lisp_Object parser, TSNode node)
+{
+  struct Lisp_TS_Node *lisp_node
+    = ALLOCATE_PSEUDOVECTOR (struct Lisp_TS_Node, parser, PVEC_TS_NODE);
+  lisp_node->parser = parser;
+  lisp_node->node = node;
+  return make_lisp_ptr (lisp_node, Lisp_Vectorlike);
+}
+
+
+/* Tree-sitter parser.  */
+
+DEFUN ("tree-sitter-parse", Ftree_sitter_parse, Stree_sitter_parse,
+       2, 2, 0,
+       doc: /* Parse STRING and return a parser object.
+LANGUAGE should be the language provided by a tree-sitter language
+dynamic module.  */)
+  (Lisp_Object string, Lisp_Object language)
+{
+  CHECK_STRING (string);
+
+  /* LANGUAGE is a USER_PTR that contains the pointer to a
+     TSLanguage struct.  */
+  TSParser *parser = ts_parser_new ();
+  TSLanguage *lang = (XUSER_PTR (language)->p);
+  ts_parser_set_language (parser, lang);
+
+  TSTree *tree = ts_parser_parse_string (parser, NULL,
+					 SSDATA (string),
+					 strlen (SSDATA (string)));
+
+  /* See comment for ts_parser_parse in tree_sitter/api.h
+     for possible reasons for a failure.  */
+  if (tree == NULL)
+    signal_error ("Failed to parse STRING", string);
+
+  TSNode root_node = ts_tree_root_node (tree);
+
+  Lisp_Object lisp_parser = make_ts_parser (NULL, parser, tree);
+  Lisp_Object lisp_node = make_ts_node (lisp_parser, root_node);
+
+  return lisp_node;
+}
+
+DEFUN ("tree-sitter-node-string",
+       Ftree_sitter_node_string, Stree_sitter_node_string, 1, 1, 0,
+       doc: /* Return the string representation of NODE.  */)
+  (Lisp_Object node)
+{
+  TSNode ts_node = XTS_NODE (node)->node;
+  char *string = ts_node_string(ts_node);
+  return make_string(string, strlen (string));
+}
+
+DEFUN ("tree-sitter-node-parent",
+       Ftree_sitter_node_parent, Stree_sitter_node_parent, 1, 1, 0,
+       doc: /* Return the immediate parent of NODE.
+Return nil if couldn't find any.  */)
+  (Lisp_Object node)
+{
+  TSNode ts_node = XTS_NODE (node)->node;
+  TSNode parent = ts_node_parent(ts_node);
+
+  if (ts_node_is_null(parent))
+    return Qnil;
+
+  return make_ts_node(XTS_NODE (node)->parser, parent);
+}
+
+DEFUN ("tree-sitter-node-child",
+       Ftree_sitter_node_child, Stree_sitter_node_child, 2, 2, 0,
+       doc: /* Return the Nth child of NODE.
+Return nil if couldn't find any.  */)
+  (Lisp_Object node, Lisp_Object n)
+{
+  CHECK_INTEGER (n);
+  EMACS_INT idx = XFIXNUM (n);
+  TSNode ts_node = XTS_NODE (node)->node;
+  // FIXME: Is this cast ok?
+  TSNode child = ts_node_child(ts_node, (uint32_t) idx);
+
+  if (ts_node_is_null(child))
+    return Qnil;
+
+  return make_ts_node(XTS_NODE (node)->parser, child);
+}
+
+/* Initialize the tree-sitter routines.  */
+void
+syms_of_tree_sitter (void)
+{
+  defsubr (&Stree_sitter_parse);
+  defsubr (&Stree_sitter_node_string);
+  defsubr (&Stree_sitter_node_parent);
+  defsubr (&Stree_sitter_node_child);
+}
diff --git a/src/tree_sitter.h b/src/tree_sitter.h
new file mode 100644
index 0000000000..3c9e03475f
--- /dev/null
+++ b/src/tree_sitter.h
@@ -0,0 +1,87 @@
+/* Header file for the tree-sitter integration.
+
+Copyright (C) 2021 Free Software Foundation, Inc.
+
+This file is part of GNU Emacs.
+
+GNU Emacs is free software: you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation, either version 3 of the License, or (at
+your option) any later version.
+
+GNU Emacs is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GNU Emacs.  If not, see <https://www.gnu.org/licenses/>.  */
+
+#ifndef EMACS_TREE_SITTER_H
+#define EMACS_TREE_SITTER_H
+
+#include <sys/types.h>
+
+#include "lisp.h"
+
+#include <tree_sitter/api.h>
+
+INLINE_HEADER_BEGIN
+
+struct Lisp_TS_Parser
+{
+  union vectorlike_header header;
+  struct buffer *buffer;
+  TSParser *parser;
+  TSTree *tree;
+  TSInput input;
+};
+
+struct Lisp_TS_Node
+{
+  union vectorlike_header header;
+  /* This should prevent the gc from collecting the parser before the
+     node is done with it.  TSNode contains a pointer to the tree it
+     belongs to, and the parser object, when collected by gc, will
+     free that tree. */
+  Lisp_Object parser;
+  TSNode node;
+};
+
+INLINE bool
+TS_PARSERP (Lisp_Object x)
+{
+  return PSEUDOVECTORP (x, PVEC_TS_PARSER);
+}
+
+INLINE struct Lisp_TS_Parser *
+XTS_PARSER (Lisp_Object a)
+{
+  eassert (TS_PARSERP (a));
+  return XUNTAG (a, Lisp_Vectorlike, struct Lisp_TS_Parser);
+}
+
+INLINE bool
+TS_NODEP (Lisp_Object x)
+{
+  return PSEUDOVECTORP (x, PVEC_TS_NODE);
+}
+
+INLINE struct Lisp_TS_Node *
+XTS_NODE (Lisp_Object a)
+{
+  eassert (TS_NODEP (a));
+  return XUNTAG (a, Lisp_Vectorlike, struct Lisp_TS_Node);
+}
+
+Lisp_Object
+make_ts_parser (struct buffer *buffer, TSParser *parser, TSTree *tree);
+
+Lisp_Object
+make_ts_node (Lisp_Object parser, TSNode node);
+
+extern void syms_of_tree_sitter (void);
+
+INLINE_HEADER_END
+
+#endif /* EMACS_TREE_SITTER_H */
-- 
2.24.3 (Apple Git-128)


[-- Attachment #2.3: Type: text/html, Size: 133 bytes --]

[-- Attachment #2.4: json-module.zip --]
[-- Type: application/zip, Size: 8797 bytes --]

[-- Attachment #2.5: Type: text/html, Size: 184 bytes --]

^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-15  2:48       ` Yuan Fu
@ 2021-07-15  6:39         ` Eli Zaretskii
  2021-07-15 13:37           ` Fu Yuan
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-15  6:39 UTC (permalink / raw)
  To: Yuan Fu; +Cc: monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Wed, 14 Jul 2021 22:48:30 -0400
> Cc: emacs-devel <emacs-devel@gnu.org>
> 
> I defined two pseudo vectors for tree-sitter's parser and node and packaged a dynamic module for tree-sitter’s json language definition. I also wrapped a few tree-sitter functions just to test if everything works. Please have a look. I’m sure there are some problems because I mainly wrote by copy, paste and modifying from other code I found in Emacs source.

Thanks, but why does it parse only strings, not buffer text?



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-15  6:39         ` Eli Zaretskii
@ 2021-07-15 13:37           ` Fu Yuan
  2021-07-15 14:18             ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Fu Yuan @ 2021-07-15 13:37 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: monnier, emacs-devel



> 在 2021年7月15日,上午2:39,Eli Zaretskii <eliz@gnu.org> 写道:
> 
> 
>> 
>> From: Yuan Fu <casouri@gmail.com>
>> Date: Wed, 14 Jul 2021 22:48:30 -0400
>> Cc: emacs-devel <emacs-devel@gnu.org>
>> 
>> I defined two pseudo vectors for tree-sitter's parser and node and packaged a dynamic module for tree-sitter’s json language definition. I also wrapped a few tree-sitter functions just to test if everything works. Please have a look. I’m sure there are some problems because I mainly wrote by copy, paste and modifying from other code I found in Emacs source.
> 
> Thanks, but why does it parse only strings, not buffer text?

I haven’t written it yet. I want to make sure the pseudo vector definition and configure files are right before going further. IIRC the contribution guide recommends sending small patches and update along the way.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-15 13:37           ` Fu Yuan
@ 2021-07-15 14:18             ` Eli Zaretskii
  2021-07-15 15:17               ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-15 14:18 UTC (permalink / raw)
  To: Fu Yuan; +Cc: monnier, emacs-devel

> From: Fu Yuan <casouri@gmail.com>
> Date: Thu, 15 Jul 2021 09:37:27 -0400
> Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org
> 
> > Thanks, but why does it parse only strings, not buffer text?
> 
> I haven’t written it yet. I want to make sure the pseudo vector definition and configure files are right before going further. IIRC the contribution guide recommends sending small patches and update along the way.

Great, then please try also to liberate the implementation from using
JSON, it's a major slowdown factor.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-15 14:18             ` Eli Zaretskii
@ 2021-07-15 15:17               ` Yuan Fu
  2021-07-15 15:50                 ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-07-15 15:17 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Stefan Monnier, emacs-devel



> On Jul 15, 2021, at 10:18 AM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: Fu Yuan <casouri@gmail.com>
>> Date: Thu, 15 Jul 2021 09:37:27 -0400
>> Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org
>> 
>>> Thanks, but why does it parse only strings, not buffer text?
>> 
>> I haven’t written it yet. I want to make sure the pseudo vector definition and configure files are right before going further. IIRC the contribution guide recommends sending small patches and update along the way.
> 
> Great, then please try also to liberate the implementation from using
> JSON, it's a major slowdown factor.

JSON? I didn’t write anything involving JSON. 

While you are looking at the patch, here are some questions for integrating tree-sitter with out buffer implementation. What I envisioned is for each buffer to have a `parser-list’, and on buffer change, we update each parser’s tree. I think modifying signal_after_change is enough to cover al the cases? And, for tree-sitter to take the buffer’s content directly, we need to tell it to skip the gap. I only need to modify gap_left, gap_right, make_gap_smaller and make_gap_larger, right?

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-15 15:17               ` Yuan Fu
@ 2021-07-15 15:50                 ` Eli Zaretskii
  2021-07-15 16:19                   ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-15 15:50 UTC (permalink / raw)
  To: Yuan Fu; +Cc: monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Thu, 15 Jul 2021 11:17:02 -0400
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,
>  emacs-devel@gnu.org
> 
> > Great, then please try also to liberate the implementation from using
> > JSON, it's a major slowdown factor.
> 
> JSON? I didn’t write anything involving JSON. 

Then what is json-module.zip about?

> While you are looking at the patch, here are some questions for integrating tree-sitter with out buffer implementation. What I envisioned is for each buffer to have a `parser-list’, and on buffer change, we update each parser’s tree. I think modifying signal_after_change is enough to cover al the cases?

Why do you need to do this when a buffer is updated? why not use
display as the trigger?  Large portions of a buffer will never be
displayed, and some buffers will not be displayed at all.  Why waste
cycles on them?  Redisplay is perfectly equipped to tell you when some
chunk of buffer text is going to be redrawn, and it already knows to
do nothing if the buffer haven't changed.

> And, for tree-sitter to take the buffer’s content directly, we need to tell it to skip the gap.

AFAIR, tree-sitter allows the calling package to provide a function to
access the text, isn't that so?  If so, you could write a function
that accesses buffer text via BYTE_POS_ADDR etc., and that knows how
to skip the gap already.

> I only need to modify gap_left, gap_right, make_gap_smaller and make_gap_larger, right?

Why would you need to _modify_ any of these?



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-15 15:50                 ` Eli Zaretskii
@ 2021-07-15 16:19                   ` Yuan Fu
  2021-07-15 16:26                     ` Yuan Fu
  2021-07-15 16:48                     ` Eli Zaretskii
  0 siblings, 2 replies; 284+ messages in thread
From: Yuan Fu @ 2021-07-15 16:19 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: monnier, emacs-devel



> On Jul 15, 2021, at 11:50 AM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: Yuan Fu <casouri@gmail.com>
>> Date: Thu, 15 Jul 2021 11:17:02 -0400
>> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,
>> emacs-devel@gnu.org
>> 
>>> Great, then please try also to liberate the implementation from using
>>> JSON, it's a major slowdown factor.
>> 
>> JSON? I didn’t write anything involving JSON. 
> 
> Then what is json-module.zip about?

That’s a language definition for tree-sitter, so it tells tree-sitter how to parse a JSON file. There are definitions for Python, Ruby, C, etc. I just used JSON for an example. It’s named json-module because it is a dynamic module.

> 
>> While you are looking at the patch, here are some questions for integrating tree-sitter with out buffer implementation. What I envisioned is for each buffer to have a `parser-list’, and on buffer change, we update each parser’s tree. I think modifying signal_after_change is enough to cover al the cases?
> 
> Why do you need to do this when a buffer is updated? why not use
> display as the trigger?  Large portions of a buffer will never be
> displayed, and some buffers will not be displayed at all.  Why waste
> cycles on them?  Redisplay is perfectly equipped to tell you when some
> chunk of buffer text is going to be redrawn, and it already knows to
> do nothing if the buffer haven't changed.

Tree-sitter expects you to tell it every single change to the parsed text. Say you have a buffer with some content and scrolled through it, so tree-sitter has parsed the whole buffer. Then some elisp edited some text outside the visible portion. Redisplay doesn’t happen, we don’t tell this edit to tree-sitter. Then I scroll to the place that has been edited. What now? I’ve lost the change information, and tree-sitter’s tree is out-dated.

We can fontify on-demand, but we can’t parse on-demand. What we can do is to only parse the portion from BOB to the visible portion. So we won’t parse the whole buffer unless you scroll to the bottom.

> 
>> And, for tree-sitter to take the buffer’s content directly, we need to tell it to skip the gap.
> 
> AFAIR, tree-sitter allows the calling package to provide a function to
> access the text, isn't that so?  If so, you could write a function
> that accesses buffer text via BYTE_POS_ADDR etc., and that knows how
> to skip the gap already.

Yes, that function returns a char*. But what if the gap is in the middle of the portion that tree-sitter wants to read? Alternatively, we can copy the text out and pass it to tree-sitter, but you don’t like that, IIRC.

> 
>> I only need to modify gap_left, gap_right, make_gap_smaller and make_gap_larger, right?
> 
> Why would you need to _modify_ any of these?

Because I want to let tree-sitter to know where is the gap so it can avoid it when reading text.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-15 16:19                   ` Yuan Fu
@ 2021-07-15 16:26                     ` Yuan Fu
  2021-07-15 16:50                       ` Eli Zaretskii
  2021-07-15 16:48                     ` Eli Zaretskii
  1 sibling, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-07-15 16:26 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Stefan Monnier, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 830 bytes --]

> 
>> 
>>> And, for tree-sitter to take the buffer’s content directly, we need to tell it to skip the gap.
>> 
>> AFAIR, tree-sitter allows the calling package to provide a function to
>> access the text, isn't that so?  If so, you could write a function
>> that accesses buffer text via BYTE_POS_ADDR etc., and that knows how
>> to skip the gap already.
> 
> Yes, that function returns a char*. But what if the gap is in the middle of the portion that tree-sitter wants to read? Alternatively, we can copy the text out and pass it to tree-sitter, but you don’t like that, IIRC.

Or we can only copy out when the portion tree-sitter wants encompasses the gap, I expect this case to be relatively rare so we won’t copy out all the time, and most of the time tree-sitter just reads from the buffer directly.

Yuan

[-- Attachment #2: Type: text/html, Size: 3096 bytes --]

^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-15 16:19                   ` Yuan Fu
  2021-07-15 16:26                     ` Yuan Fu
@ 2021-07-15 16:48                     ` Eli Zaretskii
  2021-07-15 18:23                       ` Yuan Fu
  2021-07-20 16:25                       ` Stephen Leake
  1 sibling, 2 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-15 16:48 UTC (permalink / raw)
  To: Yuan Fu; +Cc: monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Thu, 15 Jul 2021 12:19:31 -0400
> Cc: monnier@iro.umontreal.ca,
>  emacs-devel@gnu.org
> 
> > Why do you need to do this when a buffer is updated? why not use
> > display as the trigger?  Large portions of a buffer will never be
> > displayed, and some buffers will not be displayed at all.  Why waste
> > cycles on them?  Redisplay is perfectly equipped to tell you when some
> > chunk of buffer text is going to be redrawn, and it already knows to
> > do nothing if the buffer haven't changed.
> 
> Tree-sitter expects you to tell it every single change to the parsed text.

That cannot be true, because the parsed text could be in a state where
parsing it will fail.  When you are in the middle of writing the code,
this is what will happen many times, even if you pass the whole buffer
to the parser.  And since tree-sitter _must_ be able to deal with this
problem, it also must be able to receive incomplete parts of the
buffer text, and do the best it can with it.

> Say you have a buffer with some content and scrolled through it, so tree-sitter has parsed the whole buffer. Then some elisp edited some text outside the visible portion. Redisplay doesn’t happen, we don’t tell this edit to tree-sitter. Then I scroll to the place that has been edited. What now?

Now you call tree-sitter passing it the part of the buffer that needs
to be parsed (e.g., the chunk that is about to be displayed).  If
tree-sitter needs to look back, it will.

> I’ve lost the change information, and tree-sitter’s tree is out-dated.

No information is lost because the updated buffer text is available.

> We can fontify on-demand, but we can’t parse on-demand.

Sorry, I don't believe this is true.  tree-sitter _must_ be able to
deal with these situations, because it must be able to deal with
incomplete text that cannot be parsed without parse errors.

In addition, Emacs records (for redisplay purposes) two places in each
buffer related to changes: the minimum buffer position before which no
changes were done since last redisplay, and the maximum buffer
position beyond which there were no changes.  This can also be used to
pass only a small part of the buffer to the parser, because the rest
didn't change.

> What we can do is to only parse the portion from BOB to the visible portion. So we won’t parse the whole buffer unless you scroll to the bottom.

My primary worry is the fact that you want to use buffer-change hooks
(and will soon enough want to use post-command-hook as well).  They
slow down editing, sometimes tremendously, so I'd very much prefer not
to use those hooks for fontification/parsing.  The original font-lock
mechanism in Emacs 19 used these hooks; we switched to jit-lock and
its redisplay-triggered fontifications because the original design had
problems which couldn't be solved reliably and with reasonable
performance.  I hope we will not make the mistake of going back to
that sub-optimal design.

> >> And, for tree-sitter to take the buffer’s content directly, we need to tell it to skip the gap.
> > 
> > AFAIR, tree-sitter allows the calling package to provide a function to
> > access the text, isn't that so?  If so, you could write a function
> > that accesses buffer text via BYTE_POS_ADDR etc., and that knows how
> > to skip the gap already.
> 
> Yes, that function returns a char*. But what if the gap is in the middle of the portion that tree-sitter wants to read?

If you provide the function that returns text one character at a time,
as AFAIR tree-sitter allows, you will be able to skip the gap
automagically by using BYTE_POS_ADDR.  If that's not possible for some
reason, or not performant enough, we could ask tree-sitter developers
to add an API that access buffer text in two chunks, in which case it
will be called first with text before the gap, and then with text
after the gap.  Like we do when we call regex search functions.

> Alternatively, we can copy the text out and pass it to tree-sitter, but you don’t like that, IIRC.

Yes, because it means memory allocation, which could be slow,
especially for large buffers.  It could even fail if the buffer is
large enough and the system is under memory pressure.

> >> I only need to modify gap_left, gap_right, make_gap_smaller and make_gap_larger, right?
> > 
> > Why would you need to _modify_ any of these?
> 
> Because I want to let tree-sitter to know where is the gap so it can avoid it when reading text.

Knowing where is the gap doesn't need any changes to these functions.
See GPT_BYTE, GPT_SIZE, BUF_GPT_BYTE, and BUF_GPT_SIZE.  And the gap
cannot move while tree-sitter accesses the buffer, because no other
part of the Lisp machine can run at that time.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-15 16:26                     ` Yuan Fu
@ 2021-07-15 16:50                       ` Eli Zaretskii
  0 siblings, 0 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-15 16:50 UTC (permalink / raw)
  To: Yuan Fu; +Cc: monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Thu, 15 Jul 2021 12:26:25 -0400
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,
>  emacs-devel@gnu.org
> 
> Or we can only copy out when the portion tree-sitter wants encompasses the gap, I expect this case to be
> relatively rare so we won’t copy out all the time, and most of the time tree-sitter just reads from the buffer
> directly.

Actually, I expect this to happen quite frequently, because the gap is
usually where the editing happens.

We could, of course, move the gap out of the way temporarily, but
that's somewhat expensive, so it is better to avoid it.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-15 16:48                     ` Eli Zaretskii
@ 2021-07-15 18:23                       ` Yuan Fu
  2021-07-16  7:30                         ` Eli Zaretskii
  2021-07-20 16:27                         ` Stephen Leake
  2021-07-20 16:25                       ` Stephen Leake
  1 sibling, 2 replies; 284+ messages in thread
From: Yuan Fu @ 2021-07-15 18:23 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Stefan Monnier, emacs-devel



> On Jul 15, 2021, at 12:48 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: Yuan Fu <casouri@gmail.com>
>> Date: Thu, 15 Jul 2021 12:19:31 -0400
>> Cc: monnier@iro.umontreal.ca,
>> emacs-devel@gnu.org
>> 
>>> Why do you need to do this when a buffer is updated? why not use
>>> display as the trigger?  Large portions of a buffer will never be
>>> displayed, and some buffers will not be displayed at all.  Why waste
>>> cycles on them?  Redisplay is perfectly equipped to tell you when some
>>> chunk of buffer text is going to be redrawn, and it already knows to
>>> do nothing if the buffer haven't changed.
>> 
>> Tree-sitter expects you to tell it every single change to the parsed text.
> 
> That cannot be true, because the parsed text could be in a state where
> parsing it will fail.  When you are in the middle of writing the code,
> this is what will happen many times, even if you pass the whole buffer
> to the parser.  And since tree-sitter _must_ be able to deal with this
> problem, it also must be able to receive incomplete parts of the
> buffer text, and do the best it can with it.
> 
>> Say you have a buffer with some content and scrolled through it, so tree-sitter has parsed the whole buffer. Then some elisp edited some text outside the visible portion. Redisplay doesn’t happen, we don’t tell this edit to tree-sitter. Then I scroll to the place that has been edited. What now?
> 
> Now you call tree-sitter passing it the part of the buffer that needs
> to be parsed (e.g., the chunk that is about to be displayed).  If
> tree-sitter needs to look back, it will.
> 
>> I’ve lost the change information, and tree-sitter’s tree is out-dated.
> 
> No information is lost because the updated buffer text is available.
> 
>> We can fontify on-demand, but we can’t parse on-demand.
> 
> Sorry, I don't believe this is true.  tree-sitter _must_ be able to
> deal with these situations, because it must be able to deal with
> incomplete text that cannot be parsed without parse errors.
> 
I think my assertion was too strong. By “can’t parse on-demand” I mean we can’t easily pass tree-sitter a random chunk of text and not letting it to parse from BOB. 

> In addition, Emacs records (for redisplay purposes) two places in each
> buffer related to changes: the minimum buffer position before which no
> changes were done since last redisplay, and the maximum buffer
> position beyond which there were no changes.  This can also be used to
> pass only a small part of the buffer to the parser, because the rest
> didn't change.
> 
>> What we can do is to only parse the portion from BOB to the visible portion. So we won’t parse the whole buffer unless you scroll to the bottom.
> 
> My primary worry is the fact that you want to use buffer-change hooks
> (and will soon enough want to use post-command-hook as well).  They
> slow down editing, sometimes tremendously, so I'd very much prefer not
> to use those hooks for fontification/parsing.  The original font-lock
> mechanism in Emacs 19 used these hooks; we switched to jit-lock and
> its redisplay-triggered fontifications because the original design had
> problems which couldn't be solved reliably and with reasonable
> performance.  I hope we will not make the mistake of going back to
> that sub-optimal design.

I understand. I want to point out that parsing is separated from fontification, and syntax-pass flushes its cache in before-change-hook. I was hoping to use the parse tree for more than fontification, e.g., motion commands like sexp-forward/backward or structural editing commands like expand-region. Another scenario: some elisp edited some text before the visible portion, the tree is not updated, now I want to select the node at point (like expand-region), I look for the leave node that contains the byte position of point. However, because the tree is out-dated, the byte position of point will not correspond to the node I want.

We can still fontify with jit-lock, it’s just parsing cannot easily work like fontification, I expect tree-sitter to work similarly to syntax-pass rather than jit-lock.

> 
>>>> And, for tree-sitter to take the buffer’s content directly, we need to tell it to skip the gap.
>>> 
>>> AFAIR, tree-sitter allows the calling package to provide a function to
>>> access the text, isn't that so?  If so, you could write a function
>>> that accesses buffer text via BYTE_POS_ADDR etc., and that knows how
>>> to skip the gap already.
>> 
>> Yes, that function returns a char*. But what if the gap is in the middle of the portion that tree-sitter wants to read?
> 
> If you provide the function that returns text one character at a time,
> as AFAIR tree-sitter allows, you will be able to skip the gap
> automagically by using BYTE_POS_ADDR.  If that's not possible for some
> reason, or not performant enough, we could ask tree-sitter developers
> to add an API that access buffer text in two chunks, in which case it
> will be called first with text before the gap, and then with text
> after the gap.  Like we do when we call regex search functions.

Yes, I make a mistake reading the api. Indeed we can read one character at a time, and gap is not an issue anymore.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-15 18:23                       ` Yuan Fu
@ 2021-07-16  7:30                         ` Eli Zaretskii
  2021-07-16 14:27                           ` Yuan Fu
  2021-07-20 16:28                           ` Stephen Leake
  2021-07-20 16:27                         ` Stephen Leake
  1 sibling, 2 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-16  7:30 UTC (permalink / raw)
  To: Yuan Fu; +Cc: monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Thu, 15 Jul 2021 14:23:02 -0400
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,
>  emacs-devel@gnu.org
> 
> >> Say you have a buffer with some content and scrolled through it, so tree-sitter has parsed the whole buffer. Then some elisp edited some text outside the visible portion. Redisplay doesn’t happen, we don’t tell this edit to tree-sitter. Then I scroll to the place that has been edited. What now?
> > 
> > Now you call tree-sitter passing it the part of the buffer that needs
> > to be parsed (e.g., the chunk that is about to be displayed).  If
> > tree-sitter needs to look back, it will.
> > 
> >> I’ve lost the change information, and tree-sitter’s tree is out-dated.
> > 
> > No information is lost because the updated buffer text is available.
> > 
> >> We can fontify on-demand, but we can’t parse on-demand.
> > 
> > Sorry, I don't believe this is true.  tree-sitter _must_ be able to
> > deal with these situations, because it must be able to deal with
> > incomplete text that cannot be parsed without parse errors.
> > 
> I think my assertion was too strong. By “can’t parse on-demand” I mean we can’t easily pass tree-sitter a random chunk of text and not letting it to parse from BOB. 

You must start from BOB only in languages that require that; not every
language does.

And even with languages that require starting from BOB, you could do
that only once, the first time a buffer needs parsing; thereafter, you
can only pass to tree-sitter the parts that were changed since the
last time.  Emacs records that information for the display engine, see
BEG_UNCHANGED and END_UNCHANGED.  If that is not enough, we could
record more information about changes to buffer text.

The main issue here is to pass the buffer text to tree-sitter lazily,
only when and as much as needed.

> I understand. I want to point out that parsing is separated from fontification, and syntax-pass flushes its cache in before-change-hook. I was hoping to use the parse tree for more than fontification, e.g., motion commands like sexp-forward/backward or structural editing commands like expand-region. Another scenario: some elisp edited some text before the visible portion, the tree is not updated, now I want to select the node at point (like expand-region), I look for the leave node that contains the byte position of point. However, because the tree is out-dated, the byte position of point will not correspond to the node I want.

Each command/feature that needs an updated TS tree will take care of
updating TS with the relevant information.  We should record whatever
we need for that as side effect of primitives that change buffer text
(in insdel.c), and use the recorded info to update TS.  But the actual
passing of text to TS should happen lazily, when we actually need its
re-parsing, not when the changes to buffer text are done.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-16  7:30                         ` Eli Zaretskii
@ 2021-07-16 14:27                           ` Yuan Fu
  2021-07-16 14:33                             ` Stefan Monnier
  2021-07-16 15:27                             ` Eli Zaretskii
  2021-07-20 16:28                           ` Stephen Leake
  1 sibling, 2 replies; 284+ messages in thread
From: Yuan Fu @ 2021-07-16 14:27 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Stefan Monnier, emacs-devel

> 
> Each command/feature that needs an updated TS tree will take care of
> updating TS with the relevant information.  We should record whatever
> we need for that as side effect of primitives that change buffer text
> (in insdel.c), and use the recorded info to update TS.  But the actual
> passing of text to TS should happen lazily, when we actually need its
> re-parsing, not when the changes to buffer text are done.

Ok, I will write it like that. Another question, how do I add a new field in struct buffer? I tried to add

	Lisp_Object ts_parser_list_;

Before 

	Lisp_Object cursor_in_non_selected_windows_;

But that wouldn't dump.

I want to put the parsers in a field rather than in a buffer local variable because I don’t want users to add/remove parsers from this list freely, otherwise the parsers could go out of sync. I plan to provide functions like add-parser, remove-parser, buffer-parser-list for users to access this list.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-16 14:27                           ` Yuan Fu
@ 2021-07-16 14:33                             ` Stefan Monnier
  2021-07-16 14:53                               ` Yuan Fu
  2021-07-16 15:27                             ` Eli Zaretskii
  1 sibling, 1 reply; 284+ messages in thread
From: Stefan Monnier @ 2021-07-16 14:33 UTC (permalink / raw)
  To: Yuan Fu; +Cc: Eli Zaretskii, emacs-devel

> I want to put the parsers in a field rather than in a buffer local variable
> because I don’t want users to add/remove parsers from this list freely,
> otherwise the parsers could go out of sync.

I wouldn't worry 'bout that: Emacs generally doesn't try to stop people
shooting themselves in the foot.  So we want to provide a convenient and
safe API but we don't have to hide its inner workings.


        Stefan




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-16 14:33                             ` Stefan Monnier
@ 2021-07-16 14:53                               ` Yuan Fu
  0 siblings, 0 replies; 284+ messages in thread
From: Yuan Fu @ 2021-07-16 14:53 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Eli Zaretskii, emacs-devel



> On Jul 16, 2021, at 10:33 AM, Stefan Monnier <monnier@iro.umontreal.ca> wrote:
> 
>> I want to put the parsers in a field rather than in a buffer local variable
>> because I don’t want users to add/remove parsers from this list freely,
>> otherwise the parsers could go out of sync.
> 
> I wouldn't worry 'bout that: Emacs generally doesn't try to stop people
> shooting themselves in the foot.  

I should’ve figured that out by now ;-)

> So we want to provide a convenient and
> safe API but we don't have to hide its inner workings.
> 

Ok, local variable then.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-16 14:27                           ` Yuan Fu
  2021-07-16 14:33                             ` Stefan Monnier
@ 2021-07-16 15:27                             ` Eli Zaretskii
  2021-07-16 15:51                               ` Yuan Fu
  1 sibling, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-16 15:27 UTC (permalink / raw)
  To: Yuan Fu; +Cc: monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Fri, 16 Jul 2021 10:27:36 -0400
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,
>  emacs-devel@gnu.org
> 
> Another question, how do I add a new field in struct buffer? I tried to add
> 
> 	Lisp_Object ts_parser_list_;
> 
> Before 
> 
> 	Lisp_Object cursor_in_non_selected_windows_;
> 
> But that wouldn't dump.

Did you see in init_buffer_once what we do with built-in fields of
struct buffer?



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-16 15:27                             ` Eli Zaretskii
@ 2021-07-16 15:51                               ` Yuan Fu
  2021-07-17  2:05                                 ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-07-16 15:51 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Stefan Monnier, emacs-devel



> On Jul 16, 2021, at 11:27 AM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: Yuan Fu <casouri@gmail.com>
>> Date: Fri, 16 Jul 2021 10:27:36 -0400
>> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,
>> emacs-devel@gnu.org
>> 
>> Another question, how do I add a new field in struct buffer? I tried to add
>> 
>> 	Lisp_Object ts_parser_list_;
>> 
>> Before 
>> 
>> 	Lisp_Object cursor_in_non_selected_windows_;
>> 
>> But that wouldn't dump.
> 
> Did you see in init_buffer_once what we do with built-in fields of
> struct buffer?

I did not, that must be why, thanks. Though I’ve changed to use a buffer-local variable as Stefan suggested.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-16 15:51                               ` Yuan Fu
@ 2021-07-17  2:05                                 ` Yuan Fu
  2021-07-17  2:23                                   ` Clément Pit-Claudel
  2021-07-17  6:56                                   ` Eli Zaretskii
  0 siblings, 2 replies; 284+ messages in thread
From: Yuan Fu @ 2021-07-17  2:05 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Stefan Monnier, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 2008 bytes --]

Please have a look at the second patch that applies on top of the first one. This time I added after-change hooks, so if you create a parser for a buffer and edit that buffer, the parser is kept updated lazily.

In summary, the parser parses the whole buffer on the first time when the user asks for the parse tree. In after-change-hook, no parsing is done, but we do update the trees with position changes. On the next time when the user asks for the parse tree, the whole buffer is re-parsed incrementally. (I didn’t read the paper, but I assume it knows where are the bits to re-parse because we updated the tree with position changes.)

Maybe this is not lazy enough, and I should do a benchmark. This is a simple benchmark that I did:

Benchmark 1: 22M json file, opened in literary mode, try parse the whole buffer, took 17s and uses 3G memory.

Benchmark2: 1.6M json file, opened in fundamental mode, first parsed the whole buffer, took 1.039s, no gc. Then  ran this:

(benchmark-run 1000
  (dotimes (_ 1000)
    (insert
     "1,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,\n"))
  (dotimes (_ 1000)
    (backward-delete-char
     (length
      "1,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,\n"))))

Result: (39.302071 8 4.3011029999999995) and many gc trimming. Then removes the parser, ran again,
Result: (33.589416 8 4.405495999999999)

No parsing is done in either run (because parsing is lazy, and I didn’t ask for the parse tree). The only difference is that, in the first run, after-change-hook updates the tree with position change. My conclusion is that after-change-hook is pretty insignificant, and the initial parse is a bit slow (on large files).

I’m running this on a 1.4 GHz Quad-Core Intel Core i5 with 16G memory.

Of course, I’m open to suggestions for a better benchmark. The amateur log of the benchmark is in benchmark.el. The json file I used in the second benchmark is benchmark.2.json. The patch is ts.2.patch.


[-- Attachment #2: ts.2.patch --]
[-- Type: application/octet-stream, Size: 11087 bytes --]

From 180aea41cdce11b9b4bdc7da0964c14c0bf8a5f0 Mon Sep 17 00:00:00 2001
From: Yuan Fu <casouri@gmail.com>
Date: Fri, 16 Jul 2021 21:11:29 -0400
Subject: [PATCH] checkpoint 2: add change-hooks

---
 src/insdel.c      |  16 +++++
 src/tree_sitter.c | 163 ++++++++++++++++++++++++++++++++++++++++++++--
 src/tree_sitter.h |  10 +++
 3 files changed, 182 insertions(+), 7 deletions(-)

diff --git a/src/insdel.c b/src/insdel.c
index e38b091f54..3c1e13d38b 100644
--- a/src/insdel.c
+++ b/src/insdel.c
@@ -31,6 +31,10 @@
 #include "region-cache.h"
 #include "pdumper.h"
 
+#ifdef HAVE_TREE_SITTER
+#include "tree_sitter.h"
+#endif
+
 static void insert_from_string_1 (Lisp_Object, ptrdiff_t, ptrdiff_t, ptrdiff_t,
 				  ptrdiff_t, bool, bool);
 static void insert_from_buffer_1 (struct buffer *, ptrdiff_t, ptrdiff_t, bool);
@@ -2152,6 +2156,11 @@ signal_before_change (ptrdiff_t start_int, ptrdiff_t end_int,
       run_hook (Qfirst_change_hook);
     }
 
+#ifdef HAVE_TREE_SITTER
+  /* FIXME: Is this the best place?  */
+  ts_before_change (start_int, end_int);
+#endif
+
   /* Now run the before-change-functions if any.  */
   if (!NILP (Vbefore_change_functions))
     {
@@ -2205,6 +2214,13 @@ signal_after_change (ptrdiff_t charpos, ptrdiff_t lendel, ptrdiff_t lenins)
   if (inhibit_modification_hooks)
     return;
 
+#ifdef HAVE_TREE_SITTER
+  /* We disrespect combine-after-change, because if we don't record
+     this change, the information that we need (the end byte position
+     of the change) will be lost.  */
+  ts_after_change (charpos, lendel, lenins);
+#endif
+
   /* If we are deferring calls to the after-change functions
      and there are no before-change functions,
      just record the args that we were going to use.  */
diff --git a/src/tree_sitter.c b/src/tree_sitter.c
index f2134c571a..7d1225161c 100644
--- a/src/tree_sitter.c
+++ b/src/tree_sitter.c
@@ -27,6 +27,7 @@ Copyright (C) 2021 Free Software Foundation, Inc.
 #include <stdlib.h>
 #include <unistd.h>
 
+#include "lisp.h"
 #include "buffer.h"
 #include "coding.h"
 #include "tree_sitter.h"
@@ -34,6 +35,98 @@ Copyright (C) 2021 Free Software Foundation, Inc.
 /* parser.h defines a macro ADVANCE that conflicts with alloc.c.   */
 #include <tree_sitter/parser.h>
 
+/* Record the byte position of the end of the (to-be) changed text.
+We have to record it now, because by the time we get to after-change
+hook, the _byte_ position of the end is lost.  */
+void
+ts_before_change (ptrdiff_t start_int, ptrdiff_t end_int)
+{
+  /* Iterate through each parser in 'tree-sitter-parser-list' and
+     record the byte position.  There could be better ways to record
+     it than storing the same position in every parser, but this is
+     the most fool-proof way, and I expect a buffer to have only one
+     parser most of the time anyway. */
+  ptrdiff_t beg_byte = CHAR_TO_BYTE (start_int);
+  ptrdiff_t old_end_byte = CHAR_TO_BYTE (end_int);
+  Lisp_Object parser_list = Fsymbol_value (Qtree_sitter_parser_list);
+  while (!NILP (parser_list))
+    {
+      Lisp_Object lisp_parser = Fcar (parser_list);
+      XTS_PARSER (lisp_parser)->edit.start_byte = beg_byte;
+      XTS_PARSER (lisp_parser)->edit.old_end_byte = old_end_byte;
+      parser_list = Fcdr (parser_list);
+    }
+}
+
+/* Update each parser's tree after the user made an edit.  This
+function does not parse the buffer and only updates the tree. (So it
+should be very fast.)  */
+void
+ts_after_change (ptrdiff_t charpos, ptrdiff_t lendel, ptrdiff_t lenins)
+{
+  ptrdiff_t new_end_byte = CHAR_TO_BYTE (charpos + lenins);
+  Lisp_Object parser_list = Fsymbol_value (Qtree_sitter_parser_list);
+  while (!NILP (parser_list))
+    {
+      Lisp_Object lisp_parser = Fcar (parser_list);
+      TSTree *tree = XTS_PARSER (lisp_parser)->tree;
+      XTS_PARSER (lisp_parser)->edit.new_end_byte = new_end_byte;
+      if (tree != NULL)
+	  ts_tree_edit (tree, &XTS_PARSER (lisp_parser)->edit);
+      parser_list = Fcdr (parser_list);
+    }
+}
+
+/* Parse the buffer.  We don't parse until we have to. When we have
+to, we call this function to parse and update the tree.  */
+void
+ts_ensure_parsed (Lisp_Object parser)
+{
+  TSParser *ts_parser = XTS_PARSER (parser)->parser;
+  TSTree *tree = XTS_PARSER(parser)->tree;
+  TSInput input = XTS_PARSER (parser)->input;
+  TSTree *new_tree = ts_parser_parse(ts_parser, tree, input);
+  XTS_PARSER (parser)->tree = new_tree;
+}
+
+/* This is the read function provided to tree-sitter to read from a
+   buffer.  It reads one character at a time and automatically skip
+   the gap.  */
+const char*
+ts_read_buffer (void *buffer, uint32_t byte_index,
+		TSPoint position, uint32_t *bytes_read)
+{
+  if (! BUFFER_LIVE_P ((struct buffer *) buffer))
+    error ("BUFFER is not live");
+
+  ptrdiff_t byte_pos = byte_index + 1;
+
+  // FIXME: Add some boundary checks?
+  /* I believe we can get away with only setting current-buffer
+     and not actually switching to it, like what we did in
+     'make_gap_1'.  */
+  struct buffer *old_buffer = current_buffer;
+  current_buffer = (struct buffer *) buffer;
+
+  /* Read one character.  */
+  char *beg;
+  int len;
+  if (byte_pos >= Z_BYTE)
+    {
+      beg = "";
+      len = 0;
+    }
+  else
+    {
+      beg = (char *) BYTE_POS_ADDR (byte_pos);
+      len = next_char_len(byte_pos);
+    }
+  *bytes_read = (uint32_t) len;
+  current_buffer = old_buffer;
+  return beg;
+}
+
+/* Wrap the parser in a Lisp_Object to be used in the Lisp machine.  */
 Lisp_Object
 make_ts_parser (struct buffer *buffer, TSParser *parser, TSTree *tree)
 {
@@ -42,10 +135,15 @@ make_ts_parser (struct buffer *buffer, TSParser *parser, TSTree *tree)
   lisp_parser->buffer = buffer;
   lisp_parser->parser = parser;
   lisp_parser->tree = tree;
-  // TODO TSInput.
+  TSInput input = {buffer, ts_read_buffer, TSInputEncodingUTF8};
+  lisp_parser->input = input;
+  TSPoint dummy_point = {0, 0};
+  TSInputEdit edit = {0, 0, 0, dummy_point, dummy_point, dummy_point};
+  lisp_parser->edit = edit;
   return make_lisp_ptr (lisp_parser, Lisp_Vectorlike);
 }
 
+/* Wrap the node in a Lisp_Object to be used in the Lisp machine.  */
 Lisp_Object
 make_ts_node (Lisp_Object parser, TSNode node)
 {
@@ -57,19 +155,59 @@ make_ts_node (Lisp_Object parser, TSNode node)
 }
 
 
-/* Tree-sitter parser.  */
+DEFUN ("tree-sitter-create-parser",
+       Ftree_sitter_create_parser, Stree_sitter_create_parser,
+       2, 2, 0,
+       doc: /* Create and return a parser in BUFFER for LANGUAGE.
+The parser is automatically added to BUFFER's
+`tree-sitter-parser-list'.  LANGUAGE should be the language provided
+by a tree-sitter language dynamic module.  */)
+  (Lisp_Object buffer, Lisp_Object language)
+{
+  CHECK_BUFFER(buffer);
+
+  /* LANGUAGE is a USER_PTR that contains the pointer to a TSLanguage
+     struct.  */
+  TSParser *parser = ts_parser_new ();
+  TSLanguage *lang = (XUSER_PTR (language)->p);
+  ts_parser_set_language (parser, lang);
+
+  Lisp_Object lisp_parser
+    = make_ts_parser (XBUFFER(buffer), parser, NULL);
+
+  // FIXME: Is this the correct way to set a buffer-local variable?
+  struct buffer *old_buffer = current_buffer;
+  set_buffer_internal (XBUFFER (buffer));
+
+  Fset (Qtree_sitter_parser_list,
+	Fcons (lisp_parser, Fsymbol_value (Qtree_sitter_parser_list)));
+
+  set_buffer_internal (old_buffer);
+  return lisp_parser;
+}
+
+DEFUN ("tree-sitter-parser-root-node",
+       Ftree_sitter_parser_root_node, Stree_sitter_parser_root_node,
+       1, 1, 0,
+       doc: /* Return the root node of PARSER.  */)
+  (Lisp_Object parser)
+{
+  ts_ensure_parsed(parser);
+  TSNode root_node = ts_tree_root_node (XTS_PARSER (parser)->tree);
+  return make_ts_node (parser, root_node);
+}
 
 DEFUN ("tree-sitter-parse", Ftree_sitter_parse, Stree_sitter_parse,
        2, 2, 0,
-       doc: /* Parse STRING and return a parser object.
+       doc: /* Parse STRING and return the root node.
 LANGUAGE should be the language provided by a tree-sitter language
 dynamic module.  */)
   (Lisp_Object string, Lisp_Object language)
 {
   CHECK_STRING (string);
 
-  /* LANGUAGE is a USER_PTR that contains the pointer to a
-     TSLanguage struct.  */
+  /* LANGUAGE is a USER_PTR that contains the pointer to a TSLanguage
+     struct.  */
   TSParser *parser = ts_parser_new ();
   TSLanguage *lang = (XUSER_PTR (language)->p);
   ts_parser_set_language (parser, lang);
@@ -104,7 +242,7 @@ DEFUN ("tree-sitter-node-string",
 DEFUN ("tree-sitter-node-parent",
        Ftree_sitter_node_parent, Stree_sitter_node_parent, 1, 1, 0,
        doc: /* Return the immediate parent of NODE.
-Return nil if couldn't find any.  */)
+Return nil if we couldn't find any.  */)
   (Lisp_Object node)
 {
   TSNode ts_node = XTS_NODE (node)->node;
@@ -119,7 +257,7 @@ DEFUN ("tree-sitter-node-parent",
 DEFUN ("tree-sitter-node-child",
        Ftree_sitter_node_child, Stree_sitter_node_child, 2, 2, 0,
        doc: /* Return the Nth child of NODE.
-Return nil if couldn't find any.  */)
+Return nil if we couldn't find any.  */)
   (Lisp_Object node, Lisp_Object n)
 {
   CHECK_INTEGER (n);
@@ -138,6 +276,17 @@ DEFUN ("tree-sitter-node-child",
 void
 syms_of_tree_sitter (void)
 {
+  DEFSYM (Qtree_sitter_parser_list, "tree-sitter-parser-list");
+  DEFVAR_LISP ("ts-parser-list", Vtree_sitter_parser_list,
+		     doc: /* A list of tree-sitter parsers.
+// TODO: more doc.
+If you removed a parser from this list, do not put it back in.  */);
+  Vtree_sitter_parser_list = Qnil;
+  Fmake_variable_buffer_local (Qtree_sitter_parser_list);
+
+
+  defsubr (&Stree_sitter_create_parser);
+  defsubr (&Stree_sitter_parser_root_node);
   defsubr (&Stree_sitter_parse);
   defsubr (&Stree_sitter_node_string);
   defsubr (&Stree_sitter_node_parent);
diff --git a/src/tree_sitter.h b/src/tree_sitter.h
index 3c9e03475f..0606f336cc 100644
--- a/src/tree_sitter.h
+++ b/src/tree_sitter.h
@@ -28,6 +28,8 @@ #define EMACS_TREE_SITTER_H
 
 INLINE_HEADER_BEGIN
 
+/* A wrapper for a tree-sitter parser, but also contains a parse tree
+   and other goodies for convenience.  */
 struct Lisp_TS_Parser
 {
   union vectorlike_header header;
@@ -35,8 +37,10 @@ #define EMACS_TREE_SITTER_H
   TSParser *parser;
   TSTree *tree;
   TSInput input;
+  TSInputEdit edit;
 };
 
+/* A wrapper around a tree-sitter node.  */
 struct Lisp_TS_Node
 {
   union vectorlike_header header;
@@ -74,6 +78,12 @@ XTS_NODE (Lisp_Object a)
   return XUNTAG (a, Lisp_Vectorlike, struct Lisp_TS_Node);
 }
 
+void
+ts_before_change (ptrdiff_t charpos, ptrdiff_t lendel);
+
+void
+ts_after_change (ptrdiff_t charpos, ptrdiff_t lendel, ptrdiff_t lenins);
+
 Lisp_Object
 make_ts_parser (struct buffer *buffer, TSParser *parser, TSTree *tree);
 
-- 
2.24.3 (Apple Git-128)


[-- Attachment #3: benchmark.2.json --]
[-- Type: application/json, Size: 1689073 bytes --]

[-- Attachment #4: benchmark.el --]
[-- Type: application/octet-stream, Size: 1134 bytes --]

checkpoint 2 - benchmark.1.json (22M) - open literally

(benchmark-run 10
  (tree-sitter-parser-root-node
   (tree-sitter-create-parser
    (current-buffer) (tree-sitter-json))))

RESULT: stuck, used all my memory (14G and still growing)

(benchmark-run 1
  (tree-sitter-parser-root-node
   (tree-sitter-create-parser
    (current-buffer) (tree-sitter-json))))

17s, 3G memory.

\f
checkpoint 2 - benchmark.2.json (1.6M) - fundamental-mode

(benchmark-run 1
  (tree-sitter-parser-root-node
   (tree-sitter-create-parser
    (current-buffer) (tree-sitter-json))))

(1.039289 0 0.0)

(benchmark-run 1000
  (dotimes (_ 1000)
    (insert
     "1,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,\n"))
  (dotimes (_ 1000)
    (backward-delete-char
     (length
      "1,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,\n"))))

With parser:    (39.302071 8 4.3011029999999995)
Without parser: (33.589416 8 4.405495999999999)

Note: Warning (undo): Buffer ‘benchmark.2.json’ undo info was
27188988 bytes long. The undo info was discarded because it
exceeded `undo-outer-limit'.

[-- Attachment #5: Type: text/plain, Size: 8 bytes --]



Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-17  2:05                                 ` Yuan Fu
@ 2021-07-17  2:23                                   ` Clément Pit-Claudel
  2021-07-17  3:12                                     ` Yuan Fu
                                                       ` (2 more replies)
  2021-07-17  6:56                                   ` Eli Zaretskii
  1 sibling, 3 replies; 284+ messages in thread
From: Clément Pit-Claudel @ 2021-07-17  2:23 UTC (permalink / raw)
  To: emacs-devel

On 7/16/21 10:05 PM, Yuan Fu wrote:
> My conclusion is that after-change-hook is pretty insignificant, and the initial parse is a bit slow (on large files).

I have no idea if it makes sense, but: does the initial parse need to be synchronous, or could you instead run the parsing in one thread, and the rest of Emacs in another? (I'm talking about concurrent execution, not cooperative threading).

In most cases there should be very limited contention, if at at all: in large buffers most of Emacs' activity will be focused on the (relatively few) characters around the gap, and most of the parser's activity will be reading from the buffer at other positions.  You do need to be careful to not read the garbage data from the gap, but otherwise seeing stale or even inconsistent data from the parser thread shouldn't be an issue, since tree-sitter is supposed to be robust to bad parses.

In fact, depending on how robust tree-sitter is, you might even be able to do the concurrency-control optimistically (parse everything up to close to the gap, check that the gap hasn't moved into the region that you read, and then resume reading or rollback).

Alternatively, maybe you could even do a full parse with minimal concurrency control: you'd make sure that the Emacs thread records not just changes to the buffer text but also movements of the gap, and then you could use that list of changes for the next parse?

Anyway, thanks for working on this!



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-17  2:23                                   ` Clément Pit-Claudel
@ 2021-07-17  3:12                                     ` Yuan Fu
  2021-07-17  7:18                                       ` Eli Zaretskii
  2021-07-17  7:16                                     ` Eli Zaretskii
  2021-07-17 17:30                                     ` Stefan Monnier
  2 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-07-17  3:12 UTC (permalink / raw)
  To: Clément Pit-Claudel; +Cc: emacs-devel



> On Jul 16, 2021, at 10:23 PM, Clément Pit-Claudel <cpitclaudel@gmail.com> wrote:
> 
> On 7/16/21 10:05 PM, Yuan Fu wrote:
>> My conclusion is that after-change-hook is pretty insignificant, and the initial parse is a bit slow (on large files).
> 
> I have no idea if it makes sense, but: does the initial parse need to be synchronous, or could you instead run the parsing in one thread, and the rest of Emacs in another? (I'm talking about concurrent execution, not cooperative threading).
> 
> In most cases there should be very limited contention, if at at all: in large buffers most of Emacs' activity will be focused on the (relatively few) characters around the gap, and most of the parser's activity will be reading from the buffer at other positions.  You do need to be careful to not read the garbage data from the gap, but otherwise seeing stale or even inconsistent data from the parser thread shouldn't be an issue, since tree-sitter is supposed to be robust to bad parses.
> 
> In fact, depending on how robust tree-sitter is, you might even be able to do the concurrency-control optimistically (parse everything up to close to the gap, check that the gap hasn't moved into the region that you read, and then resume reading or rollback).
> 
> Alternatively, maybe you could even do a full parse with minimal concurrency control: you'd make sure that the Emacs thread records not just changes to the buffer text but also movements of the gap, and then you could use that list of changes for the next parse?

Another way I thought about is to only “expose” the portion of buffer from BOB to some point to tree-sitter. And when a user asks for a parse tree, he also specifies to which point of the buffer he needs the parse tree for. For example, for fortification, jit-lock only needs the tree up to the end of the visible window. And for structure editing, asking for the portion up to window-end + a few thousand characters might be enough. However this heuristic could have problems in practice. (Maybe a giant comment section of thousands of characters follows, and instead of jumping to the end of it, we wrongly jump to middle of that comment section, because tree-sitter only “sees” to that point.) So I don’t know if it’s a good idea.

> 
> Anyway, thanks for working on this!
> 

I figure that this is low-tech enough that an amateur like me could possibly do it ;-)

Yuan




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-17  2:05                                 ` Yuan Fu
  2021-07-17  2:23                                   ` Clément Pit-Claudel
@ 2021-07-17  6:56                                   ` Eli Zaretskii
  1 sibling, 0 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-17  6:56 UTC (permalink / raw)
  To: Yuan Fu; +Cc: monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Fri, 16 Jul 2021 22:05:01 -0400
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,
>  emacs-devel@gnu.org
> 
> Please have a look at the second patch that applies on top of the first one. This time I added after-change hooks, so if you create a parser for a buffer and edit that buffer, the parser is kept updated lazily.

Instead of using the hook machinery, it is better to augment the
insdel.c functions to directly update the information you need to
keep.  If you do it directly in the insdel.c functions that modify the
buffer text, you can be much more accurate in updating the information
about the changes, because each insdel.c function performs a
well-defined operation of the buffer text.  By contrast, buffer-change
hooks are higher-level functionality, meant for Lisp programs, so they
don't necessarily make it easy to reverse-engineer the specific
changes.  All you have there is some higher-level information about
which part of the buffer changed.  Moreover, the hooks are sometimes
called more times than they should be, to be on the safe side.

As a trivial (but not insignificant!) optimization, primitive insdel.c
functions always know both the character and the byte positions in the
buffer they change, so this code you needed in your hooks:

> +  ptrdiff_t beg_byte = CHAR_TO_BYTE (start_int);
> +  ptrdiff_t old_end_byte = CHAR_TO_BYTE (end_int);

could be avoided.  Converting character to byte positions can
sometimes significantly slow down the code, for example if there are a
lot of markers in the buffer.

So I urge you to record the change information directly in the
primitive functions of insdel.c, not in the hooks.

> In summary, the parser parses the whole buffer on the first time when the user asks for the parse tree. In after-change-hook, no parsing is done, but we do update the trees with position changes. On the next time when the user asks for the parse tree, the whole buffer is re-parsed incrementally. (I didn’t read the paper, but I assume it knows where are the bits to re-parse because we updated the tree with position changes.)

Why do you update the entire parser list for every modification?  This
comment:

> +void
> +ts_before_change (ptrdiff_t start_int, ptrdiff_t end_int)
> +{
> +  /* Iterate through each parser in 'tree-sitter-parser-list' and
> +     record the byte position.  There could be better ways to record
> +     it than storing the same position in every parser, but this is
> +     the most fool-proof way, and I expect a buffer to have only one
> +     parser most of the time anyway. */

already says that there are better ways: record the change info just
once in one place.  Then propagate that to the entire list, if you
need, when you actually need to call TS.  It is possible that at the
call time you will know which parser needs to be called, and will be
able to update only that parser, not the entire list.

> +  // FIXME: Add some boundary checks?
> +  /* I believe we can get away with only setting current-buffer
> +     and not actually switching to it, like what we did in
> +     'make_gap_1'.  */
> +  struct buffer *old_buffer = current_buffer;
> +  current_buffer = (struct buffer *) buffer;

This looks unnecessary: we have BUF_BYTE_ADDRESS, which accepts the
buffer as its argument, and the corresponding buf_next_char_len.  IOW,
why did you need to switch to the buffer?

> +  /* Read one character.  */
> +  char *beg;
> +  int len;
> +  if (byte_pos >= Z_BYTE)
> +    {
> +      beg = "";
> +      len = 0;
> +    }

Is getting an empty string what TS wants when it attempts to read
beyond EOB?

Also, why do you test Z_BYTE and not ZV_BYTE (actually, BUF_ZV_BYTE)?
Emacs in general behaves as if text beyond point-max didn't exist, why
should code supported by the TS parser behave differently?

> +      beg = (char *) BYTE_POS_ADDR (byte_pos);
> +      len = next_char_len(byte_pos);

This is sub-optimal: next_char_len also calls BYTE_POS_ADDR.  Why not
use BYTES_BY_CHAR_HEAD instead?

Thanks for working on this.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-17  2:23                                   ` Clément Pit-Claudel
  2021-07-17  3:12                                     ` Yuan Fu
@ 2021-07-17  7:16                                     ` Eli Zaretskii
  2021-07-20 20:36                                       ` Clément Pit-Claudel
  2021-07-17 17:30                                     ` Stefan Monnier
  2 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-17  7:16 UTC (permalink / raw)
  To: Clément Pit-Claudel; +Cc: emacs-devel

> From: Clément Pit-Claudel <cpitclaudel@gmail.com>
> Date: Fri, 16 Jul 2021 22:23:26 -0400
> 
> On 7/16/21 10:05 PM, Yuan Fu wrote:
> > My conclusion is that after-change-hook is pretty insignificant, and the initial parse is a bit slow (on large files).
> 
> I have no idea if it makes sense, but: does the initial parse need to be synchronous, or could you instead run the parsing in one thread, and the rest of Emacs in another? (I'm talking about concurrent execution, not cooperative threading).

You cannot have a thread freely accessing buffer text when the Lisp
machine is allowed to run concurrently with this, because the Lisp
machine can change the buffer text.

> In most cases there should be very limited contention, if at at all: in large buffers most of Emacs' activity will be focused on the (relatively few) characters around the gap, and most of the parser's activity will be reading from the buffer at other positions.

When Emacs moves or enlarges/shrinks the gap, that affects the entire
buffer text after the gap, regardless of where the gap is.  So it will
affect the TS reader if it reads stuff after the gap.

> You do need to be careful to not read the garbage data from the gap, but otherwise seeing stale or even inconsistent data from the parser thread shouldn't be an issue, since tree-sitter is supposed to be robust to bad parses.

What would be the purpose of calling the parser if we know in advance
it will fail when it gets to the "garbage" caused by async access to
the buffer text?

And besides, current Emacs primitives that access buffer text don't
necessarily do that atomically, since the assumption built into their
design is that no one should access that text at the same time.  So
you could have windows where the buffer text is in inconsistent state,
like if the gap was moved, but the variables which tell where the gap
is were not yet updated, or windows where a multibyte character was
not yet completely written or deleted to/from the buffer, resulting in
invalid multibyte sequences and inconsistent values of EOB.

So I don't see how this could be done without some inter-locking.

And what do you want the code which requested parsing do while the
parse thread runs?  The requesting code is in the main thread, so if
it just waits, you don't gain anything.

> In fact, depending on how robust tree-sitter is, you might even be able to do the concurrency-control optimistically (parse everything up to close to the gap, check that the gap hasn't moved into the region that you read, and then resume reading or rollback).

I don't understand what you suggest here.  For starters, the gap could
move (assuming you are still talking about a separate thread that does
the parsing), and what do we do then?

> Alternatively, maybe you could even do a full parse with minimal concurrency control: you'd make sure that the Emacs thread records not just changes to the buffer text but also movements of the gap, and then you could use that list of changes for the next parse?

I don't understand what could recording the gap solve.  The stuff in
the gap is generally garbage, and can easily include invalid multibyte
sequences.  I don't think it's a good idea to pass that to TS.  Also,
recording the gap changes in the main thread and accessing that
information from a concurrent thread again opens a window for races,
and requires synchronization.

Bottom line, I think what you are suggesting is premature
optimization: we don't yet know that we will need this.  If the TS
performance information is reliable, it should be fast enough for our
purposes; we just need to come up with an optimal way of calling it so
that we don't impose unnecessary delays.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-17  3:12                                     ` Yuan Fu
@ 2021-07-17  7:18                                       ` Eli Zaretskii
  0 siblings, 0 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-17  7:18 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Fri, 16 Jul 2021 23:12:00 -0400
> Cc: emacs-devel@gnu.org
> 
> Another way I thought about is to only “expose” the portion of buffer from BOB to some point to tree-sitter. And when a user asks for a parse tree, he also specifies to which point of the buffer he needs the parse tree for. For example, for fortification, jit-lock only needs the tree up to the end of the visible window. And for structure editing, asking for the portion up to window-end + a few thousand characters might be enough.

Yes, I think we should only ask TS to parse what we need, not more.

> However this heuristic could have problems in practice. (Maybe a giant comment section of thousands of characters follows, and instead of jumping to the end of it, we wrongly jump to middle of that comment section, because tree-sitter only “sees” to that point.) So I don’t know if it’s a good idea.

It's definitely a good idea that should be pursued.  Even if in some
specific situation you'd need to pass to TS a large part of buffer
text, it will help in the other cases.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-17  2:23                                   ` Clément Pit-Claudel
  2021-07-17  3:12                                     ` Yuan Fu
  2021-07-17  7:16                                     ` Eli Zaretskii
@ 2021-07-17 17:30                                     ` Stefan Monnier
  2021-07-17 17:54                                       ` Eli Zaretskii
                                                         ` (2 more replies)
  2 siblings, 3 replies; 284+ messages in thread
From: Stefan Monnier @ 2021-07-17 17:30 UTC (permalink / raw)
  To: Clément Pit-Claudel; +Cc: emacs-devel

In your benchmark , you give numbers for:
- initial full-text parse (a bit above 1MB/s)
- cost of update-without-reparse

but I think it would be nice to see the cost of the reparse after
those updates (should be much faster than the initial parse).

Clément said:
> I have no idea if it makes sense, but: does the initial parse need to be
> synchronous, or could you instead run the parsing in one thread, and the
> rest of Emacs in another? (I'm talking about concurrent execution, not
> cooperative threading).

If we copy the buffer's content to a freshly malloc area before passing
that to TS, then there should be no problem running TS in a separate
concurrent thread, indeed.

Eli said:
> Why do you update the entire parser list for every modification?
> This comment:

If having multiple parsers in a single buffer is a not-uncommon case,
then indeed we'll need to do better, but if we assume this is an
anomalous situation, then Yuan's code is optimal ;-)

> Yes, I think we should only ask TS to parse what we need, not more.

We'll need to experiment with that.  Using an approach like
`syntax-ppss` where we only parse up to some high-watermark might be
a good approach, but it's also possible that it will work poorly: if TS
assumes it works on the whole buffer, then it will see the truncated
text as a syntax error and while it is supposed to handle syntax errors
nicely it may still lead to suboptimal behavior when parts of perfectly
valid code is misparsed because the parser was not allowed to see the
closing braces that make it "perfectly valid".


        Stefan




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-17 17:30                                     ` Stefan Monnier
@ 2021-07-17 17:54                                       ` Eli Zaretskii
  2021-07-24 14:08                                         ` Stefan Monnier
  2021-07-19 15:16                                       ` Yuan Fu
  2021-07-20 16:32                                       ` Stephen Leake
  2 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-17 17:54 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: cpitclaudel, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org
> Date: Sat, 17 Jul 2021 13:30:40 -0400
> 
> Clément said:
> > I have no idea if it makes sense, but: does the initial parse need to be
> > synchronous, or could you instead run the parsing in one thread, and the
> > rest of Emacs in another? (I'm talking about concurrent execution, not
> > cooperative threading).
> 
> If we copy the buffer's content to a freshly malloc area before passing
> that to TS, then there should be no problem running TS in a separate
> concurrent thread, indeed.

Making a copy of the buffer is a non-starter from where I stand.  It
doesn't scale, for starters.  I don't see any reason to go to such a
complex design at this early stage.

> Eli said:
> > Why do you update the entire parser list for every modification?
> > This comment:
> 
> If having multiple parsers in a single buffer is a not-uncommon case,
> then indeed we'll need to do better, but if we assume this is an
> anomalous situation, then Yuan's code is optimal ;-)
> 
> > Yes, I think we should only ask TS to parse what we need, not more.
> 
> We'll need to experiment with that.

We can experiment, but I think the basic design should be clean and
reasonable from the get-go.

> Using an approach like `syntax-ppss` where we only parse up to some
> high-watermark might be a good approach, but it's also possible that
> it will work poorly: if TS assumes it works on the whole buffer,
> then it will see the truncated text as a syntax error and while it
> is supposed to handle syntax errors nicely it may still lead to
> suboptimal behavior when parts of perfectly valid code is misparsed
> because the parser was not allowed to see the closing braces that
> make it "perfectly valid".

TS must be able to handle these situation well enough, because they
happen during editing all the time.  I wouldn't worry about that,
definitely not at this stage.

Different uses of the parse results will need to pass different chunks
of buffer text, and that is okay.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-17 17:30                                     ` Stefan Monnier
  2021-07-17 17:54                                       ` Eli Zaretskii
@ 2021-07-19 15:16                                       ` Yuan Fu
  2021-07-22  3:10                                         ` Yuan Fu
  2021-07-20 16:32                                       ` Stephen Leake
  2 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-07-19 15:16 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Clément Pit-Claudel, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1080 bytes --]


> On Jul 17, 2021, at 1:30 PM, Stefan Monnier <monnier@iro.umontreal.ca> wrote:
> 
> In your benchmark , you give numbers for:
> - initial full-text parse (a bit above 1MB/s)
> - cost of update-without-reparse
> 
> but I think it would be nice to see the cost of the reparse after
> those updates (should be much faster than the initial parse).

I have done some more benchmark. Initially I thought tree-sitter doesn’t scale, because re-parsing my JSON file is unexpectedly slow, but then I retired with xdisp.c with tree-sitter's C parser, and that is really fast and matches my expectation of tree-sitter. So from now on I’ll use xdispf.c and the C parser for benchmarking. I guess the json parser is simply bad-written?

I benchmarked with a simple C program. The programs are in main-c.c and main-json.c, and the shell output of the measurements is in benchmark.3.txt.

JSON: Initial parse takes 1.2s, re-parse (with no change) takes 0.7s, uses 307MB memory
C: Initial parse takes 0.14s, re-parse (with no change) takes 0.009s, uses 20MB memory

Yuan


[-- Attachment #2: benchmark.3.txt --]
[-- Type: text/plain, Size: 2875 bytes --]

On benchmark.2.json (1.6M)

One full parse: 1.2s
________________________________________________________
Executed in    1.30 secs   fish           external
   usr time  1210.81 millis  142.00 micros  1210.67 millis
   sys time   87.40 millis  756.00 micros   86.65 millis


One full parse and a re-parse:
________________________________________________________
Executed in    2.40 secs   fish           external
   usr time    1.95 secs  154.00 micros    1.95 secs
   sys time    0.15 secs  763.00 micros    0.15 secs

Re-parse takes 1.95 - 1.21 = 0.74s

Memory usage of full-parse + re-parse:

        2.17 real         2.00 user         0.16 sys
           307269632  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               75035  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                 463  involuntary context switches
         14674957821  instructions retired
          7838514409  cycles elapsed
           306745344  peak memory footprint

307MB for two trees that "shares internal structure".

\f

On xdisp.c (1.2M)

One full paese: 0.139s
________________________________________________________
Executed in  478.23 millis    fish           external
   usr time  139.69 millis  134.00 micros  139.55 millis
   sys time    8.05 millis  829.00 micros    7.22 millis

Full parse and re-parse:
________________________________________________________
Executed in  456.58 millis    fish           external
   usr time  148.23 millis  153.00 micros  148.08 millis
   sys time    9.08 millis  791.00 micros    8.29 millis

148 - 139 = 0.009s

Memory usage of full-parse + re-parse:

        0.16 real         0.15 user         0.00 sys
            20131840  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                4932  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  28  involuntary context switches
          1070525817  instructions retired
           581557699  cycles elapsed
            19271680  peak memory footprint

20MB

[-- Attachment #3: main-c.c --]
[-- Type: application/octet-stream, Size: 1180 bytes --]

#include <string.h>
#include <stdio.h>
#include <tree_sitter/api.h>

TSLanguage *tree_sitter_c();

struct buffer {
  char *buf;
  long len;
};

const char *read_file(void *payload, uint32_t byte_index,
                TSPoint position, uint32_t *bytes_read) {
  long len = ((struct buffer *) payload)->len;
  if (byte_index >= len) {
    *bytes_read = 0;
    return (char *) "";
  } else {
    *bytes_read = len - byte_index;
    return (char *) (((struct buffer *) payload)->buf) + byte_index;
  }
}

int main() {
  TSParser *parser = ts_parser_new();
  ts_parser_set_language(parser, tree_sitter_c());

  /* Copy the file into BUFFER. */
  FILE *file = fopen("xdisp.c", "rb");
  fseek(file, 0, SEEK_END);
  long length = ftell (file);
  fseek(file, 0, SEEK_SET);
  char *buffer = malloc (length);
  fread(buffer, 1, length, file);
  fclose (file);

  struct buffer buf = {buffer, length};
  TSInput input = {&buf, read_file, TSInputEncodingUTF8};
  
  TSTree *tree = ts_parser_parse(parser, NULL, input);
  TSTree *new_tree = ts_parser_parse(parser, tree, input);
  
  free(buffer);
  ts_tree_delete(tree);
  ts_tree_delete(new_tree);
  ts_parser_delete(parser);

  return 0;
}

[-- Attachment #4: main-json.c --]
[-- Type: application/octet-stream, Size: 1195 bytes --]

#include <string.h>
#include <stdio.h>
#include <tree_sitter/api.h>

TSLanguage *tree_sitter_json();

struct buffer {
  char *buf;
  long len;
};

const char *read_file(void *payload, uint32_t byte_index,
                TSPoint position, uint32_t *bytes_read) {
  long len = ((struct buffer *) payload)->len;
  if (byte_index >= len) {
    *bytes_read = 0;
    return (char *) "";
  } else {
    *bytes_read = len - byte_index;
    return (char *) (((struct buffer *) payload)->buf) + byte_index;
  }
}

int main() {
  TSParser *parser = ts_parser_new();
  ts_parser_set_language(parser, tree_sitter_json());

  /* Copy the file into BUFFER. */
  FILE *file = fopen("benchmark.3.json", "rb");
  fseek(file, 0, SEEK_END);
  long length = ftell (file);
  fseek(file, 0, SEEK_SET);
  char *buffer = malloc (length);
  fread(buffer, 1, length, file);
  fclose (file);

  struct buffer buf = {buffer, length};
  TSInput input = {&buf, read_file, TSInputEncodingUTF8};
  
  TSTree *tree = ts_parser_parse(parser, NULL, input);
  TSTree *new_tree = ts_parser_parse(parser, tree, input);
  
  free(buffer);
  ts_tree_delete(tree);
  ts_tree_delete(new_tree);
  ts_parser_delete(parser);

  return 0;
}

^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-15 16:48                     ` Eli Zaretskii
  2021-07-15 18:23                       ` Yuan Fu
@ 2021-07-20 16:25                       ` Stephen Leake
  2021-07-20 16:45                         ` Eli Zaretskii
  1 sibling, 1 reply; 284+ messages in thread
From: Stephen Leake @ 2021-07-20 16:25 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Yuan Fu, monnier, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Yuan Fu <casouri@gmail.com>
>> Date: Thu, 15 Jul 2021 12:19:31 -0400
>> Cc: monnier@iro.umontreal.ca,
>>  emacs-devel@gnu.org
>> 
>> > Why do you need to do this when a buffer is updated? why not use
>> > display as the trigger?  Large portions of a buffer will never be
>> > displayed, and some buffers will not be displayed at all.  Why waste
>> > cycles on them?  Redisplay is perfectly equipped to tell you when some
>> > chunk of buffer text is going to be redrawn, and it already knows to
>> > do nothing if the buffer haven't changed.
>> 
>> Tree-sitter expects you to tell it every single change to the parsed text.
>
> That cannot be true, because the parsed text could be in a state where
> parsing it will fail.  

You can relax this to "when a parse is requested, tree-sitter must be
given the net changes to the text". You can combine several changes into
one, if that saves time or something.

But tree-sitter does have to deal with incorrect syntax.

> When you are in the middle of writing the code, this is what will
> happen many times, even if you pass the whole buffer to the parser.

Yes.

> And since tree-sitter _must_ be able to deal with this problem, it
> also must be able to receive incomplete parts of the buffer text, and
> do the best it can with it.

That does not follow.

I took that approach with ada-mode, and the results are not good. Mostly
this is because Ada requires always parsing from BOB, so parsing only
part of the buffer is bound to give bad results.

Knowing the changes from a previous complete parse allows the parser to
do a much better job.

>> Say you have a buffer with some content and scrolled through it, so
>> tree-sitter has parsed the whole buffer. Then some elisp edited some
>> text outside the visible portion. Redisplay doesn’t happen, we don’t
>> tell this edit to tree-sitter. Then I scroll to the place that has
>> been edited. What now?
>
> Now you call tree-sitter passing it the part of the buffer that needs
> to be parsed (e.g., the chunk that is about to be displayed).  If
> tree-sitter needs to look back, it will.

No, you pass tree-sitter the net list of changes since the last parse
was requested. Changes outside the visible region can easily affect the
visible region; consider inserting a comment or block start or end.

>> I’ve lost the change information, and tree-sitter’s tree is out-dated.
>
> No information is lost because the updated buffer text is available.

That is useful only if the previous buffer text is also available, so
you can diff it. It is more efficient to keep a list of changes.
Although if that list grows too large, it can be better to simply start
over, and parse the whole buffer again.

> In addition, Emacs records (for redisplay purposes) two places in each
> buffer related to changes: the minimum buffer position before which no
> changes were done since last redisplay, and the maximum buffer
> position beyond which there were no changes.  This can also be used to
> pass only a small part of the buffer to the parser, because the rest
> didn't change.

Again, the input to tree-sitter is a list of changes, not a block of
text containing changes.

That is because of the way incremental parsing works.

The list of changes to the buffer text are used to edit the parse tree,
deleting nodes that represent deleted or modified text, lexing the new
text to create new nodes.

Then the parser is run on the edited tree, _not_ on the buffer text. The
parser adds new nodes as appropriate to arrive at a complete parse tree.

There's no point in trying to tell the parser how much to parse; any
non-edited portion of the original text will be represented in the
edited tree by one or a small number of nodes; the parser then consumes
those quickly.

>> What we can do is to only parse the portion from BOB to the visible
>> portion. So we won’t parse the whole buffer unless you scroll to the
>> bottom.

You can stop parsing at the end of a complete grammar production; in
languages that require parsing from BOB, that is always EOB. The parser
cannot stop at an arbitrary point in the text; that would leave an
incomplete tree.

The point of incremental parsing is that parsing unchanged text is very
fast, because it is represented by a small number of nodes in the edited
tree.

> My primary worry is the fact that you want to use buffer-change hooks
> (and will soon enough want to use post-command-hook as well).  They
> slow down editing, sometimes tremendously, so I'd very much prefer not
> to use those hooks for fontification/parsing.  The original font-lock
> mechanism in Emacs 19 used these hooks; we switched to jit-lock and
> its redisplay-triggered fontifications because the original design had
> problems which couldn't be solved reliably and with reasonable
> performance.  I hope we will not make the mistake of going back to
> that sub-optimal design.

Ah. That could be a problem; incremental parsing fundamentally requires
a list of changes.

If the parser is in an Emacs module, so it has direct access to the
buffer, then the hooks only need to record the buffer positions of the
insertions and deletions, not the new text. That should be very fast.
Then the parse is only requested when the results are needed for
something, like indent or fontify.

That is how wisi works, except the parser is currently in an external
process, so the buffer change hooks also have to store the new text,
which can be large. Which is a good reason to improve wisi to support
the parser in a module.

In addition, the code that computes the requested information
(fontification or indentation) takes region bounds as input, and only
computes the information for that region (using the full parse tree);
that is much faster than always computing all information for the entire
buffer.

eglot, on the other hand, sends the change information to the LSP server
immediately (or after small delay), and then tries to do something with
the response, rather than waiting until some event triggers a need for
information from the server.

I'm guessing that font-lock ran the actual fontification functions from
the buffer-change hooks; that would be slow.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-15 18:23                       ` Yuan Fu
  2021-07-16  7:30                         ` Eli Zaretskii
@ 2021-07-20 16:27                         ` Stephen Leake
  1 sibling, 0 replies; 284+ messages in thread
From: Stephen Leake @ 2021-07-20 16:27 UTC (permalink / raw)
  To: Yuan Fu; +Cc: Eli Zaretskii, Stefan Monnier, emacs-devel

Yuan Fu <casouri@gmail.com> writes:

> ... I was hoping to use the parse tree for more than
> fontification, e.g., motion commands like sexp-forward/backward or
> structural editing commands like expand-region. 

wisi currently supports this.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-16  7:30                         ` Eli Zaretskii
  2021-07-16 14:27                           ` Yuan Fu
@ 2021-07-20 16:28                           ` Stephen Leake
  1 sibling, 0 replies; 284+ messages in thread
From: Stephen Leake @ 2021-07-20 16:28 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Yuan Fu, monnier, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> Each command/feature that needs an updated TS tree will take care of
> updating TS with the relevant information.  We should record whatever
> we need for that as side effect of primitives that change buffer text
> (in insdel.c), and use the recorded info to update TS.  But the actual
> passing of text to TS should happen lazily, when we actually need its
> re-parsing, not when the changes to buffer text are done.

Yes.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-17 17:30                                     ` Stefan Monnier
  2021-07-17 17:54                                       ` Eli Zaretskii
  2021-07-19 15:16                                       ` Yuan Fu
@ 2021-07-20 16:32                                       ` Stephen Leake
  2021-07-20 16:48                                         ` Eli Zaretskii
                                                           ` (2 more replies)
  2 siblings, 3 replies; 284+ messages in thread
From: Stephen Leake @ 2021-07-20 16:32 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Clément Pit-Claudel, emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> In your benchmark , you give numbers for:
> - initial full-text parse (a bit above 1MB/s)
> - cost of update-without-reparse
>
> but I think it would be nice to see the cost of the reparse after
> those updates (should be much faster than the initial parse).
>
> Clément said:
>> I have no idea if it makes sense, but: does the initial parse need to be
>> synchronous, or could you instead run the parsing in one thread, and the
>> rest of Emacs in another? (I'm talking about concurrent execution, not
>> cooperative threading).
>
> If we copy the buffer's content to a freshly malloc area before passing
> that to TS, then there should be no problem running TS in a separate
> concurrent thread, indeed.

Except that the results will not be useful, since they won't apply to
the original buffer if it is changed. And if the original buffer is not
changed, then we do not need to run the parser asynchronously.

Computing fontification and indentation must be synchronous.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-20 16:25                       ` Stephen Leake
@ 2021-07-20 16:45                         ` Eli Zaretskii
  2021-07-21 15:49                           ` Stephen Leake
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-20 16:45 UTC (permalink / raw)
  To: Stephen Leake; +Cc: casouri, monnier, emacs-devel

> From: Stephen Leake <stephen_leake@stephe-leake.org>
> Cc: Yuan Fu <casouri@gmail.com>,  monnier@iro.umontreal.ca,
>   emacs-devel@gnu.org
> Date: Tue, 20 Jul 2021 09:25:11 -0700
> 
> > In addition, Emacs records (for redisplay purposes) two places in each
> > buffer related to changes: the minimum buffer position before which no
> > changes were done since last redisplay, and the maximum buffer
> > position beyond which there were no changes.  This can also be used to
> > pass only a small part of the buffer to the parser, because the rest
> > didn't change.
> 
> Again, the input to tree-sitter is a list of changes, not a block of
> text containing changes.

I fail to see the significance of the difference.  Surely, you could
hand it a block of text with changes to mean that this block replaces
the previous version of that block.  It might take the parser more
work to update the parse tree in this case, but if it's fast enough,
that won't be the problem.  Right?

> If the parser is in an Emacs module, so it has direct access to the
> buffer, then the hooks only need to record the buffer positions of the
> insertions and deletions, not the new text. That should be very fast.

(You are talking about the undo-list.)

But even this is wasteful: it is quite customary to delete, then
re-insert, then re-delete again, etc. several times.  So collecting
these operations will produce much more "changes" than strictly
needed.  That's why I'm trying to find a simpler, less wasteful
strategies.  Since TS is very fast, we can trade some of the speed for
simpler, more scalable design of tracking changes.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-20 16:32                                       ` Stephen Leake
@ 2021-07-20 16:48                                         ` Eli Zaretskii
  2021-07-20 17:38                                           ` Stefan Monnier
  2021-07-20 17:36                                         ` Stefan Monnier
  2021-07-20 18:04                                         ` Clément Pit-Claudel
  2 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-20 16:48 UTC (permalink / raw)
  To: Stephen Leake; +Cc: cpitclaudel, monnier, emacs-devel

> From: Stephen Leake <stephen_leake@stephe-leake.org>
> Date: Tue, 20 Jul 2021 09:32:23 -0700
> Cc: Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  emacs-devel@gnu.org
> 
> Computing fontification and indentation must be synchronous.

I wouldn't say "must", but going async on them certainly brings in a
lot more complexity, and we should avoid that unless it's REALLY
needed.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-20 16:32                                       ` Stephen Leake
  2021-07-20 16:48                                         ` Eli Zaretskii
@ 2021-07-20 17:36                                         ` Stefan Monnier
  2021-07-20 18:05                                           ` Clément Pit-Claudel
  2021-07-21 16:02                                           ` Stephen Leake
  2021-07-20 18:04                                         ` Clément Pit-Claudel
  2 siblings, 2 replies; 284+ messages in thread
From: Stefan Monnier @ 2021-07-20 17:36 UTC (permalink / raw)
  To: Stephen Leake; +Cc: Clément Pit-Claudel, emacs-devel

>> If we copy the buffer's content to a freshly malloc area before passing
>> that to TS, then there should be no problem running TS in a separate
>> concurrent thread, indeed.
> Except that the results will not be useful, since they won't apply to
> the original buffer if it is changed.

Not true: we just have to keep track of the list of changes (as Yuan's
patch does), then pass it to tree-sitter to get a tree up-to-date w.r.t
the current content of the buffer.

> And if the original buffer is not changed, then we do not need to run
> the parser asynchronously.

We do:
- because we want to do other things in the mean time
- because we want to take advantage of the many CPU cores sitting idle.


        Stefan




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-20 16:48                                         ` Eli Zaretskii
@ 2021-07-20 17:38                                           ` Stefan Monnier
  0 siblings, 0 replies; 284+ messages in thread
From: Stefan Monnier @ 2021-07-20 17:38 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Stephen Leake, cpitclaudel, emacs-devel

>> Computing fontification and indentation must be synchronous.
> I wouldn't say "must", but going async on them certainly brings in a
> lot more complexity, and we should avoid that unless it's REALLY
> needed.

Agreed.  Tree-sitter's *re*parse is supposed to be fast enough
for that.  My suggestion to do it concurrently was mostly aimed at the
initial parse (which does imply that the initial fontification would be
async for those modes which depend on tree-sitter for fontification).


        Stefan




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-20 16:32                                       ` Stephen Leake
  2021-07-20 16:48                                         ` Eli Zaretskii
  2021-07-20 17:36                                         ` Stefan Monnier
@ 2021-07-20 18:04                                         ` Clément Pit-Claudel
  2021-07-20 18:24                                           ` Eli Zaretskii
  2021-07-21 16:54                                           ` [SPAM UNSURE] " Stephen Leake
  2 siblings, 2 replies; 284+ messages in thread
From: Clément Pit-Claudel @ 2021-07-20 18:04 UTC (permalink / raw)
  To: Stephen Leake, Stefan Monnier; +Cc: emacs-devel

On 7/20/21 12:32 PM, Stephen Leake wrote:
> Computing fontification and indentation must be synchronous.

Must?  What makes you say that?

> Except that the results will not be useful, since they won't apply to
the original buffer if it is changed.

Then you will send the additional changes and wait.
TS is an incremental parser, so the work it will have done incorporating part of the changes will not be wasted.

Concrete example: if you have a bit of elisp that runs for .5s to make modifications to the buffer, then press "indent", and only then do you send changes to TS and wait for the response synchronously, then you will wait for .5s + time to incorporate all changes.  If you start processing the changes in parallel as they are made by the Elisp code, then you will only wait for .5s + time to incorporate only the changes that had not been processed yet.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-20 17:36                                         ` Stefan Monnier
@ 2021-07-20 18:05                                           ` Clément Pit-Claudel
  2021-07-21 16:02                                           ` Stephen Leake
  1 sibling, 0 replies; 284+ messages in thread
From: Clément Pit-Claudel @ 2021-07-20 18:05 UTC (permalink / raw)
  To: Stefan Monnier, Stephen Leake; +Cc: emacs-devel

On 7/20/21 1:36 PM, Stefan Monnier wrote:
>>> If we copy the buffer's content to a freshly malloc area before passing
>>> that to TS, then there should be no problem running TS in a separate
>>> concurrent thread, indeed.
>> Except that the results will not be useful, since they won't apply to
>> the original buffer if it is changed.
> 
> Not true: we just have to keep track of the list of changes (as Yuan's
> patch does), then pass it to tree-sitter to get a tree up-to-date w.r.t
> the current content of the buffer.
> 
>> And if the original buffer is not changed, then we do not need to run
>> the parser asynchronously.
> 
> We do:
> - because we want to do other things in the mean time
> - because we want to take advantage of the many CPU cores sitting idle.

Ah, sorry, I didn't see your message, so I sent an answer that's approximately equivalent.
But note that I'm not even sure we need to copy the buffer.  Of course, I agree that it makes a lot of things a lot simpler.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-20 18:04                                         ` Clément Pit-Claudel
@ 2021-07-20 18:24                                           ` Eli Zaretskii
  2021-07-21 16:54                                           ` [SPAM UNSURE] " Stephen Leake
  1 sibling, 0 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-20 18:24 UTC (permalink / raw)
  To: Clément Pit-Claudel; +Cc: stephen_leake, monnier, emacs-devel

> From: Clément Pit-Claudel <cpitclaudel@gmail.com>
> Date: Tue, 20 Jul 2021 14:04:25 -0400
> Cc: emacs-devel@gnu.org
> 
> Concrete example: if you have a bit of elisp that runs for .5s to make modifications to the buffer, then press "indent", and only then do you send changes to TS and wait for the response synchronously, then you will wait for .5s + time to incorporate all changes.  If you start processing the changes in parallel as they are made by the Elisp code, then you will only wait for .5s + time to incorporate only the changes that had not been processed yet.

Your example is too abstract and disregards the issues that Emacs has
with such "pure" parallelism.  In my response to your original
proposal I tried to explain the difficulties with implementing your
suggestions _in_Emacs_, and the complexity which any such
implementation will bring with it.  When you compare synchronous with
async implementation, you need to take those difficulties and
complexities into consideration, otherwise the comparison will not be
useful.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-17  7:16                                     ` Eli Zaretskii
@ 2021-07-20 20:36                                       ` Clément Pit-Claudel
  2021-07-21 11:26                                         ` Eli Zaretskii
  2021-07-21 16:29                                         ` Stephen Leake
  0 siblings, 2 replies; 284+ messages in thread
From: Clément Pit-Claudel @ 2021-07-20 20:36 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Thanks for the detailed reply.

On 7/17/21 3:16 AM, Eli Zaretskii wrote:
> When Emacs moves or enlarges/shrinks the gap, that affects the entire
> buffer text after the gap, regardless of where the gap is.  So it will
> affect the TS reader if it reads stuff after the gap.

Doesn't enlarging the gap require allocating a new buffer and copying data to it?  If so it wouldn't affect the TS reader.  Moving is indeed trickier, that's what I referred to as "limited contention".

>> You do need to be careful to not read the garbage data from the gap, but otherwise seeing stale or even inconsistent data from the parser thread shouldn't be an issue, since tree-sitter is supposed to be robust to bad parses.
> 
> What would be the purpose of calling the parser if we know in advance
> it will fail when it gets to the "garbage" caused by async access to
> the buffer text?

It won't fail, will it?  I thought this was the point of TS, that it would reuse the initial parse on the "good" parts in subsequent parses.

> So I don't see how this could be done without some inter-locking.

Yes, there probably need to be some care around the gap area.  But that's what I was referring to re. "optimistic concurrency".

> And what do you want the code which requested parsing do while the
> parse thread runs?  The requesting code is in the main thread, so if
> it just waits, you don't gain anything.

You'd have the parser running continuously in the background, every time there is a change.  When a piece of code requests a parse it blocks and waits, but presumably for not too long because a very recent previous parse means that the blocking parse is fast.

>> In fact, depending on how robust tree-sitter is, you might even be able to do the concurrency-control optimistically (parse everything up to close to the gap, check that the gap hasn't moved into the region that you read, and then resume reading or rollback).
> 
> I don't understand what you suggest here.  For starters, the gap could
> move (assuming you are still talking about a separate thread that does
> the parsing), and what do we do then?

Nothing, we start the next parse when this one completes.

> I don't understand what could recording the gap solve.  The stuff in
> the gap is generally garbage, and can easily include invalid multibyte
> sequences.  I don't think it's a good idea to pass that to TS.  Also,
> recording the gap changes in the main thread and accessing that
> information from a concurrent thread again opens a window for races,
> and requires synchronization.

This list of gap changes wouldn't be accessed concurrently: you would (message-)pass a copy of it to the parser thread every time it starts a new parse.

> Bottom line, I think what you are suggesting is premature
> optimization: we don't yet know that we will need this. 

I thought we knew that a full parse of some files could take over a second; but yes, it will be nice if we can find a synchronous way to avoid having to do a full parse.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-20 20:36                                       ` Clément Pit-Claudel
@ 2021-07-21 11:26                                         ` Eli Zaretskii
  2021-07-21 13:38                                           ` Clément Pit-Claudel
  2021-07-21 16:29                                         ` Stephen Leake
  1 sibling, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-21 11:26 UTC (permalink / raw)
  To: Clément Pit-Claudel; +Cc: emacs-devel

> Cc: emacs-devel@gnu.org
> From: Clément Pit-Claudel <cpitclaudel@gmail.com>
> Date: Tue, 20 Jul 2021 16:36:42 -0400
> 
> Thanks for the detailed reply.
> 
> On 7/17/21 3:16 AM, Eli Zaretskii wrote:
> > When Emacs moves or enlarges/shrinks the gap, that affects the entire
> > buffer text after the gap, regardless of where the gap is.  So it will
> > affect the TS reader if it reads stuff after the gap.
> 
> Doesn't enlarging the gap require allocating a new buffer and copying data to it?

Not necessarily.  First, gap could be enlarged for reasons other than
growing buffer text as a whole.  And even if we must grow buffer text,
a good memory-allocation system will many times resize the existing
memory block before it allocates another..

> If so it wouldn't affect the TS reader.

Not true, in general.  When a new block is allocated by the OS/libc,
the old one is generally invalid and cannot be accessed.  In many
cases, the old block could be unmapped from the program's address
space, in which case accessing it will segfault.

> Moving is indeed trickier, that's what I referred to as "limited contention".

We move the gap quite a lot.

> >> You do need to be careful to not read the garbage data from the gap, but otherwise seeing stale or even inconsistent data from the parser thread shouldn't be an issue, since tree-sitter is supposed to be robust to bad parses.
> > 
> > What would be the purpose of calling the parser if we know in advance
> > it will fail when it gets to the "garbage" caused by async access to
> > the buffer text?
> 
> It won't fail, will it?

"Fail" in the sense that it will be able to process only a small
portion of buffer text before it gets to garbage.

> > And what do you want the code which requested parsing do while the
> > parse thread runs?  The requesting code is in the main thread, so if
> > it just waits, you don't gain anything.
> 
> You'd have the parser running continuously in the background, every time there is a change.  When a piece of code requests a parse it blocks and waits, but presumably for not too long because a very recent previous parse means that the blocking parse is fast.

Well, you cannot safely/usefully parse the buffer "continuously in the
background", for the reasons explained above, because Lisp programs
change buffer text quite a lot.

> > I don't understand what could recording the gap solve.  The stuff in
> > the gap is generally garbage, and can easily include invalid multibyte
> > sequences.  I don't think it's a good idea to pass that to TS.  Also,
> > recording the gap changes in the main thread and accessing that
> > information from a concurrent thread again opens a window for races,
> > and requires synchronization.
> 
> This list of gap changes wouldn't be accessed concurrently: you would (message-)pass a copy of it to the parser thread every time it starts a new parse.

I still don't see the point.  Can you describe in more detail what
would you suggest doing with the list of gap changes?  Just take a
specific example of a small set of gap changes and tell how to use
that.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-21 11:26                                         ` Eli Zaretskii
@ 2021-07-21 13:38                                           ` Clément Pit-Claudel
  2021-07-21 13:51                                             ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Clément Pit-Claudel @ 2021-07-21 13:38 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

On 7/21/21 7:26 AM, Eli Zaretskii wrote:
> I still don't see the point.  Can you describe in more detail what
> would you suggest doing with the list of gap changes?  Just take a
> specific example of a small set of gap changes and tell how to use
> that.

I can try, but the idea was half-baked from the start, so I'm not sure how much value it will bring.  All I was saying is that depending on how robust TS is, feeding it:

   <valuable text><small bit of the gap because the gap moved while TS was scanning><more valuable data>

and then, knowing that the gap had moved, re-feeding it just the area that corresponds to the places around the boundaries of the gap might yield a speedup.

So if the buffer is XYYGGGZ, where G is the gap, and becomes XGGIYYZ while we're scanning because of cursor motion + an insertion, then TS might see XYGIYYZ, due to concurrent mutations; but if we recorded that the gap moved and insertions happened at -#####---, then we can re-feed GGIYY to TS (omitting the Gs, of course), and hopefully it can reuse the parse of X and Z.  If X and Z are long enough, that can be valuable.

Alternatively, keeping the list of changes allows us to maintain a copy of the buffer that TS uses for scanning, with updates delayed until TS is done scanning.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-21 13:38                                           ` Clément Pit-Claudel
@ 2021-07-21 13:51                                             ` Eli Zaretskii
  2021-07-22  4:59                                               ` Clément Pit-Claudel
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-21 13:51 UTC (permalink / raw)
  To: Clément Pit-Claudel; +Cc: emacs-devel

> From: Clément Pit-Claudel <cpitclaudel@gmail.com>
> Date: Wed, 21 Jul 2021 09:38:31 -0400
> Cc: emacs-devel@gnu.org
> 
>    <valuable text><small bit of the gap because the gap moved while TS was scanning><more valuable data>
> 
> and then, knowing that the gap had moved, re-feeding it just the area that corresponds to the places around the boundaries of the gap might yield a speedup.

You are assuming that TS will be able to process both <valuable text>
and <more valuable data>, even though it eats the garbage in the gap?
That isn't guaranteed, due to possibly invalid byte sequences in the
gap.

Without synchronization, you also risk reading invalid byte sequences
even outside the gap, because while you read part of a byte sequence,
some editing operation modifies the buffer at that very place.

> Alternatively, keeping the list of changes allows us to maintain a copy of the buffer that TS uses for scanning, with updates delayed until TS is done scanning.

Having a copy for each buffer that needs parsing doesn't scale.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-20 16:45                         ` Eli Zaretskii
@ 2021-07-21 15:49                           ` Stephen Leake
  2021-07-21 19:37                             ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Stephen Leake @ 2021-07-21 15:49 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: casouri, monnier, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Stephen Leake <stephen_leake@stephe-leake.org>
>> Cc: Yuan Fu <casouri@gmail.com>,  monnier@iro.umontreal.ca,
>>   emacs-devel@gnu.org
>> Date: Tue, 20 Jul 2021 09:25:11 -0700
>> 
>> > In addition, Emacs records (for redisplay purposes) two places in each
>> > buffer related to changes: the minimum buffer position before which no
>> > changes were done since last redisplay, and the maximum buffer
>> > position beyond which there were no changes.  This can also be used to
>> > pass only a small part of the buffer to the parser, because the rest
>> > didn't change.
>> 
>> Again, the input to tree-sitter is a list of changes, not a block of
>> text containing changes.
>
> I fail to see the significance of the difference.  Surely, you could
> hand it a block of text with changes to mean that this block replaces
> the previous version of that block.  It might take the parser more
> work to update the parse tree in this case, but if it's fast enough,
> that won't be the problem.  Right?

tree-sitter doesn't store the previous text, so there's nothing to
compare it to. Alternately, this would require the parser to store the
previous text so it can compute the diff; that could be added in a
wrapper around tree-sitter.

wisi does store the previous text, so it could compute the diff. But
because of memory pressure, we want a design that does not require a
copy of the buffer text; when wisi is turned into an Emacs module, it
will not store a copy of the text.

>> If the parser is in an Emacs module, so it has direct access to the
>> buffer, then the hooks only need to record the buffer positions of the
>> insertions and deletions, not the new text. That should be very fast.
>
> (You are talking about the undo-list.)

Almost; the undo-list can get reset before the parser needs it. And
sometimes it is disabled. But it might make sense to try to use that
instead of maintaining a separate list of changes.

It might make sense to delete the matching change from the parser change
list when undo is invoked, rather than adding another change.

> But even this is wasteful: it is quite customary to delete, then
> re-insert, then re-delete again, etc. several times.  So collecting
> these operations will produce much more "changes" than strictly
> needed.  

Yes. The wisi parser Ada code includes a step that combines all the
changes (in arbitrary buffer-pos order) into a minimal list of changes
in buffer-pos order; that simplifies applying multiple changes to the
parse tree. We could move that to elisp, if that would help (it's in Ada
because I much prefer debugging Ada to debugging elisp). That could be
done in the buffer-change hook; if the current change can be combined
with the previous one, do that instead of adding a new one.

> That's why I'm trying to find a simpler, less wasteful strategies.
> Since TS is very fast, we can trade some of the speed for simpler,
> more scalable design of tracking changes.

I don't see how optimizing the change list makes it more "scalable"; the
worst case is that the optimal list is the complete list of actions the
user takes, and that will happen often enough to be an important case.

In practice font-lock is triggered on every character typed by the user
(Emacs is faster than people can type), so there will typically be only
one change; nothing to optimize.

In the case where some elisp is changing the buffer in several places
(ie indent-region, or some other re-format), optimizing the change list
might make sense, if the elisp code is not already optimized for that.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-20 17:36                                         ` Stefan Monnier
  2021-07-20 18:05                                           ` Clément Pit-Claudel
@ 2021-07-21 16:02                                           ` Stephen Leake
  2021-07-21 17:16                                             ` Stefan Monnier
  1 sibling, 1 reply; 284+ messages in thread
From: Stephen Leake @ 2021-07-21 16:02 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Clément Pit-Claudel, emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>> If we copy the buffer's content to a freshly malloc area before passing
>>> that to TS, then there should be no problem running TS in a separate
>>> concurrent thread, indeed.
>> Except that the results will not be useful, since they won't apply to
>> the original buffer if it is changed.
>
> Not true: we just have to keep track of the list of changes (as Yuan's
> patch does), then pass it to tree-sitter to get a tree up-to-date w.r.t
> the current content of the buffer.

The point is that more changes can happen while the parser is running.

>> And if the original buffer is not changed, then we do not need to run
>> the parser asynchronously.
>
> We do:
> - because we want to do other things in the mean time
> - because we want to take advantage of the many CPU cores sitting
> idle.

"using more cores" means parallel execution, which can still be
synchronous. That could be done for all invocations of the parser. 

In the typical case of opening a new buffer, it might make sense
to spawn a thread to compute the fontification while the rest of the
major-mode-hook runs. Except functions on that hook could affect the
fontification, and not by changing the buffer; they could set the
fontification level or style.

Are there "other things" that are guaranteed to not affect
fontification?

In any case, the buffer must be read-only while the fontification is
being computed, so either the main emacs thread must wait for the
fontification to complete, or it must actually mark the buffer read-only
until the fontification completes, which could surprise the user.

On the other hand, if we don't force read-only, it might be possible to
use only part of the fontification information, up to the first change.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-20 20:36                                       ` Clément Pit-Claudel
  2021-07-21 11:26                                         ` Eli Zaretskii
@ 2021-07-21 16:29                                         ` Stephen Leake
  2021-07-21 16:54                                           ` Clément Pit-Claudel
  1 sibling, 1 reply; 284+ messages in thread
From: Stephen Leake @ 2021-07-21 16:29 UTC (permalink / raw)
  To: Clément Pit-Claudel; +Cc: Eli Zaretskii, emacs-devel

Clément Pit-Claudel <cpitclaudel@gmail.com> writes:

> On 7/17/21 3:16 AM, Eli Zaretskii wrote:
>>> You do need to be careful to not read the garbage data from the
>>> gap, but otherwise seeing stale or even inconsistent data from the
>>> parser thread shouldn't be an issue, since tree-sitter is supposed
>>> to be robust to bad parses.
>> 
>> What would be the purpose of calling the parser if we know in advance
>> it will fail when it gets to the "garbage" caused by async access to
>> the buffer text?
>
> It won't fail, will it? I thought this was the point of TS, that it
> would reuse the initial parse on the "good" parts in subsequent
> parses.

There are limits to the error recovery, and throwing garbage text at
it is likely to encounter those limits. wisi is even more robust, but I
still get "error recover fail" daily.

>> So I don't see how this could be done without some inter-locking.
>
> Yes, there probably need to be some care around the gap area. But
> that's what I was referring to re. "optimistic concurrency".
>
>> And what do you want the code which requested parsing do while the
>> parse thread runs?  The requesting code is in the main thread, so if
>> it just waits, you don't gain anything.
>
> You'd have the parser running continuously in the background, every
> time there is a change. 

> When a piece of code requests a parse it blocks and waits, but
> presumably for not too long because a very recent previous parse means
> that the blocking parse is fast.

If the parser is truly fast enough to keep up with typing, this does
make sense. Good error correction is slower than non-so-good error
correction, so there might be a trade-off here.

On the other hand, in the typical case of the user typing characters,
font-lock is triggered on every character, so the parser is effectively
synchronous, and the inter-thread communication is wasted time.

We need some metrics on a real implementation to decide this part of the
design.

>>> In fact, depending on how robust tree-sitter is, you might even be
>>> able to do the concurrency-control optimistically (parse everything
>>> up to close to the gap, check that the gap hasn't moved into the
>>> region that you read, and then resume reading or rollback).
>> 
>> I don't understand what you suggest here.  For starters, the gap could
>> move (assuming you are still talking about a separate thread that does
>> the parsing), and what do we do then?
>
> Nothing, we start the next parse when this one completes.

By "nothing", I think you mean "abort the parse".

>> Bottom line, I think what you are suggesting is premature
>> optimization: we don't yet know that we will need this. 
>
> I thought we knew that a full parse of some files could take over a
> second; 

Yes, for both tree-sitter and wisi. wisi can take even longer if lots of
error correction is required (I have a time-out set at 5 seconds). But
that happens when the file is first opened; I doubt any user would start
typing that fast. I know I typically take a while to just look at the
text, and then navigate to the point of interest.

> but yes, it will be nice if we can find a synchronous way to avoid
> having to do a full parse.

Hmm. "looking at the text" is better done after it is fontified, so
doing a faster but possibly worse parse and fontification on just the
initial visible region might be a good idea. 

While the partial parse is running, we could also spawn a parser thread
to run the full parse.

And if the user scrolls before the full parse is done, do a second
partial parse on the new visible region.

I'll put that on my list of things to try in wisi.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-21 16:29                                         ` Stephen Leake
@ 2021-07-21 16:54                                           ` Clément Pit-Claudel
  2021-07-21 19:43                                             ` Eli Zaretskii
  2021-07-21 21:54                                             ` Stephen Leake
  0 siblings, 2 replies; 284+ messages in thread
From: Clément Pit-Claudel @ 2021-07-21 16:54 UTC (permalink / raw)
  To: emacs-devel

On 7/21/21 12:29 PM, Stephen Leake wrote:
> Yes, for both tree-sitter and wisi. wisi can take even longer if lots of
> error correction is required (I have a time-out set at 5 seconds). But
> that happens when the file is first opened; I doubt any user would start
> typing that fast. I know I typically take a while to just look at the
> text, and then navigate to the point of interest.

I'm not sure.  We've had significant complaint in Flycheck for freezing Emacs for <1s: we have a synchronous sanity check to determine whether a checker can execute in a buffer (it runs a single time, and it should be async but I haven't gotten around to rewriting it).  The problem is that some programs, including eslint, can take as much 1s, and in some bad cases 2-3 seconds, to parse their own config and decide if they can even run.

Users have complained about this delay.  It might be better if they were able to scroll around, though — is that what happens with WISI?  But if we have a fully synchronous TS, then that won't be possible either: it will be a complete Emacs freeze, no?



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: [SPAM UNSURE] Re: How to add pseudo vector types
  2021-07-20 18:04                                         ` Clément Pit-Claudel
  2021-07-20 18:24                                           ` Eli Zaretskii
@ 2021-07-21 16:54                                           ` Stephen Leake
  2021-07-21 17:12                                             ` Clément Pit-Claudel
  1 sibling, 1 reply; 284+ messages in thread
From: Stephen Leake @ 2021-07-21 16:54 UTC (permalink / raw)
  To: Clément Pit-Claudel; +Cc: Stefan Monnier, emacs-devel

Clément Pit-Claudel <cpitclaudel@gmail.com> writes:

> On 7/20/21 12:32 PM, Stephen Leake wrote:
>> Computing fontification and indentation must be synchronous.
>
> Must?  What makes you say that?

Otherwise the results cannot be applied to the buffer, in general.

>> Except that the results will not be useful, since they won't apply to
>> the original buffer if it is changed.
>
> Then you will send the additional changes and wait.

It is that "wait" that makes it synchronous.

Note that synchronous is not the same as single-thread; mulitple threads
can be used, as long as the main thread waits for the parse results.

But synchronous is also not the same as requiring the buffer text to be
read-only while the parser is running, which is an additional
requirement if the parser is reading the buffer text directly.

> TS is an incremental parser, so the work it will have done
> incorporating part of the changes will not be wasted.

Not guarranteed if the new changes are before some of the old ones, and
TS has no support for interrupting a parse to add more changes.

> Concrete example: if you have a bit of elisp that runs for .5s to make
> modifications to the buffer, then press "indent", and only then do you
> send changes to TS and wait for the response synchronously, then you
> will wait for .5s + time to incorporate all changes. If you start
> processing the changes in parallel as they are made by the Elisp code,
> then you will only wait for .5s + time to incorporate only the changes
> that had not been processed yet.

It might be possible to implement the incremental parse algorithm so it can
accept changes after the parse starts. One requirement would be that the
new changes must be after the current parse point, which is a race
condition.

In your example, "indent" will go back to the first edit point to compute
the indent there; that is pretty much guarranteed to be before the
current parse point, which will be on one of the later changes.

Neither TS nor wisi support that; both have a separate Edit_Tree step
that applies all the changes to the parse tree before Parse is called. It
might be possible to integrate Edit_Tree into Parse, so that changes are
only applied when they are actually needed. But Edit_Tree and Parse are
already very complicated; keeping them separate is a good thing for
correctness and debugging.

Hmm. Perhaps you are not talking about interrupting the parse; you are
assuming that the parse for each change completes before the next change
arrives. Depending on the details of the changes, that might or might
not be wasted time; if we are on battery power (or worried about carbon
footprint), this might be a bad idea.

It still means fontification has to wait for the parse to complete
on all of the changes; it's synchronous in the sense that no user
actions on the buffer are allowed between the time fontification is
requested and the time text properties are applied.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: [SPAM UNSURE] Re: How to add pseudo vector types
  2021-07-21 16:54                                           ` [SPAM UNSURE] " Stephen Leake
@ 2021-07-21 17:12                                             ` Clément Pit-Claudel
  2021-07-21 19:49                                               ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Clément Pit-Claudel @ 2021-07-21 17:12 UTC (permalink / raw)
  To: emacs-devel

On 7/21/21 12:54 PM, Stephen Leake wrote:
> Hmm. Perhaps you are not talking about interrupting the parse; you are
> assuming that the parse for each change completes before the next change
> arrives.

Neither of these.  I'm assuming that you open a file, launch a parse, batch up changes until that first parse completes, then launch a second parse, during which additional changes are batched up, then launch a third parse, etc.

Any time you actually need the info (for navigating, or for fontification, or…) then you either use the last parse if it was recent enough, or (more likely) you block until you can complete a synchronous parse.

This helps if you run a slow, blocking operation that edits the buffer.  Not so much otherwise, indeed.

> It still means fontification has to wait for the parse to complete
> on all of the changes; it's synchronous in the sense that no user
> actions on the buffer are allowed between the time fontification is
> requested and the time text properties are applied.

Sure, sure; but hopefully that time is shorter than if the parser hadn't received a headstart.

Also, note that my original suggestion was mostly about the initial parse:

> I have no idea if it makes sense, but: does the initial parse need to be
synchronous, or could you instead run the parsing in one thread, and the
rest of Emacs in another?

If the initial parse takes a while, you would have no fontification at all for the first <n> seconds, if that's what it takes to parse your buffer (font-lock wouldn't block, it'd return immediately).

Then, after that initial parse, you would switch to a blocking mode every time you need info.  That should be fast if the buffer hasn't changed too much.

If it has changed a lot, then you could revert to a non-blocking parse, while abandoning fontification for a little while.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-21 16:02                                           ` Stephen Leake
@ 2021-07-21 17:16                                             ` Stefan Monnier
  0 siblings, 0 replies; 284+ messages in thread
From: Stefan Monnier @ 2021-07-21 17:16 UTC (permalink / raw)
  To: Stephen Leake; +Cc: Clément Pit-Claudel, emacs-devel

Stephen Leake [2021-07-21 09:02:25] wrote:
> Stefan Monnier <monnier@iro.umontreal.ca> writes:
>> Not true: we just have to keep track of the list of changes (as Yuan's
>> patch does), then pass it to tree-sitter to get a tree up-to-date w.r.t
>> the current content of the buffer.
> The point is that more changes can happen while the parser is running.

Not if the "refresh" is done synchronously.


        Stefan




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-21 15:49                           ` Stephen Leake
@ 2021-07-21 19:37                             ` Eli Zaretskii
  2021-07-24  2:00                               ` Stephen Leake
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-21 19:37 UTC (permalink / raw)
  To: Stephen Leake; +Cc: casouri, monnier, emacs-devel

> From: Stephen Leake <stephen_leake@stephe-leake.org>
> Cc: casouri@gmail.com,  monnier@iro.umontreal.ca,  emacs-devel@gnu.org
> Date: Wed, 21 Jul 2021 08:49:15 -0700
> 
> > I fail to see the significance of the difference.  Surely, you could
> > hand it a block of text with changes to mean that this block replaces
> > the previous version of that block.  It might take the parser more
> > work to update the parse tree in this case, but if it's fast enough,
> > that won't be the problem.  Right?
> 
> tree-sitter doesn't store the previous text, so there's nothing to
> compare it to.

There was nothing about comparison in my text.  You tell TS that
editing replaced a block of text between A and B with block between A
and C, without revealing the fine-grained changes inside that block.
This must work, because editing could indeed do just that.

> Alternately, this would require the parser to store the
> previous text so it can compute the diff; that could be added in a
> wrapper around tree-sitter.

Presumably, TS has already solved this problem, because it needs that
for allowing the clients to communicate the changes to it.

> > That's why I'm trying to find a simpler, less wasteful strategies.
> > Since TS is very fast, we can trade some of the speed for simpler,
> > more scalable design of tracking changes.
> 
> I don't see how optimizing the change list makes it more "scalable";

Keeping too much information about each buffer is less scalable,
especially with many large buffers.

> In practice font-lock is triggered on every character typed by the user
> (Emacs is faster than people can type), so there will typically be only
> one change; nothing to optimize.

Editing doesn't include just typing one character at a time.  There's
killing, yanking, C-x i, M-/, M-\, C-M-\, smart completion, etc.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-21 16:54                                           ` Clément Pit-Claudel
@ 2021-07-21 19:43                                             ` Eli Zaretskii
  2021-07-24  2:57                                               ` Stephen Leake
  2021-07-24  3:55                                               ` Clément Pit-Claudel
  2021-07-21 21:54                                             ` Stephen Leake
  1 sibling, 2 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-21 19:43 UTC (permalink / raw)
  To: Clément Pit-Claudel; +Cc: emacs-devel

> From: Clément Pit-Claudel <cpitclaudel@gmail.com>
> Date: Wed, 21 Jul 2021 12:54:16 -0400
> 
> On 7/21/21 12:29 PM, Stephen Leake wrote:
> > Yes, for both tree-sitter and wisi. wisi can take even longer if lots of
> > error correction is required (I have a time-out set at 5 seconds). But
> > that happens when the file is first opened; I doubt any user would start
> > typing that fast. I know I typically take a while to just look at the
> > text, and then navigate to the point of interest.
> 
> I'm not sure.  We've had significant complaint in Flycheck for freezing Emacs for <1s

How much "less"?  Close to 1 sec is indeed annoying, but 20 msec or so
should be bearable.

You seem to assume up front that TS (re)-parsing will take 1 sec, but
AFAIK there's no reason to assume such bad performance.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: [SPAM UNSURE] Re: How to add pseudo vector types
  2021-07-21 17:12                                             ` Clément Pit-Claudel
@ 2021-07-21 19:49                                               ` Eli Zaretskii
  2021-07-22  5:09                                                 ` Clément Pit-Claudel
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-21 19:49 UTC (permalink / raw)
  To: Clément Pit-Claudel; +Cc: emacs-devel

> From: Clément Pit-Claudel <cpitclaudel@gmail.com>
> Date: Wed, 21 Jul 2021 13:12:16 -0400
> 
> On 7/21/21 12:54 PM, Stephen Leake wrote:
> > Hmm. Perhaps you are not talking about interrupting the parse; you are
> > assuming that the parse for each change completes before the next change
> > arrives.
> 
> Neither of these.  I'm assuming that you open a file, launch a parse, batch up changes until that first parse completes, then launch a second parse, during which additional changes are batched up, then launch a third parse, etc.

But how would the "launched parse" access the buffer text if it runs
in parallel to normal editing?  We've discussed the difficulties with
that, and you seem to ignore them here?

> Any time you actually need the info (for navigating, or for fontification, or…) then you either use the last parse if it was recent enough, or (more likely) you block until you can complete a synchronous parse.

Which means the results will many times be slightly wrong, because the
parse info you use is outdated?



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-21 16:54                                           ` Clément Pit-Claudel
  2021-07-21 19:43                                             ` Eli Zaretskii
@ 2021-07-21 21:54                                             ` Stephen Leake
  2021-07-22  4:40                                               ` Clément Pit-Claudel
  1 sibling, 1 reply; 284+ messages in thread
From: Stephen Leake @ 2021-07-21 21:54 UTC (permalink / raw)
  To: Clément Pit-Claudel; +Cc: emacs-devel

Clément Pit-Claudel <cpitclaudel@gmail.com> writes:

> On 7/21/21 12:29 PM, Stephen Leake wrote:
>> Yes, for both tree-sitter and wisi. wisi can take even longer if lots of
>> error correction is required (I have a time-out set at 5 seconds). But
>> that happens when the file is first opened; I doubt any user would start
>> typing that fast. I know I typically take a while to just look at the
>> text, and then navigate to the point of interest.
>
> I'm not sure. We've had significant complaint in Flycheck for freezing
> Emacs for <1s: we have a synchronous sanity check to determine whether
> a checker can execute in a buffer (it runs a single time, and it
> should be async but I haven't gotten around to rewriting it). The
> problem is that some programs, including eslint, can take as much 1s,
> and in some bad cases 2-3 seconds, to parse their own config and
> decide if they can even run.

Ok.

> Users have complained about this delay. It might be better if they
> were able to scroll around, though — is that what happens with WISI?

wisi supports partial parse; if a buffer is larger than a user-settable
threshold, for font-lock it parses only the request region of the file,
expanded to reasonable start/end points.

So in that mode, the initial parse of even a very large buffer is fast.

However, using that for indentation is problematic, which is why I'm
implementing incremental parse.

I think continuing to support both will be useful.

> But if we have a fully synchronous TS, then that won't be possible
> either: it will be a complete Emacs freeze, no?

It should only freeze write operations on that buffer, so marking it
read-only while waiting for the parse results might be best.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-19 15:16                                       ` Yuan Fu
@ 2021-07-22  3:10                                         ` Yuan Fu
  2021-07-22  8:23                                           ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-07-22  3:10 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Clément Pit-Claudel, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 579 bytes --]

Here is another patch. No big progress since I’m busy moving this week. In this patch I changed from using change hooks to directly updating the trees in edit functions. I also added some node api and tests. Should I keep posting patches, or should I create a branch in /scratch? If the latter, how do I do it?

I’m aware of the ongoing enlightening discussion on potential optimizations for tree-sitter. My plan is to first complete the api and implement some minimal structural editing/font-lock features, then we can concretely measure what needs to improve.

Yuan


[-- Attachment #2: ts.3.patch --]
[-- Type: application/octet-stream, Size: 24276 bytes --]

From fd8ad36fe5ea3b9b12e80879b7434b8bc67b53db Mon Sep 17 00:00:00 2001
From: Yuan Fu <casouri@gmail.com>
Date: Wed, 21 Jul 2021 22:43:07 -0400
Subject: [PATCH] checkpoint 3

- change_hook -> directly in edit functions
- add a need_reparse field in Lisp_TS_Parser
- more node api
- tests
---
 src/insdel.c                  |  43 ++++--
 src/tree_sitter.c             | 274 ++++++++++++++++++++++++++++------
 src/tree_sitter.h             |  18 ++-
 test/src/tree-sitter-tests.el | 106 +++++++++++++
 4 files changed, 377 insertions(+), 64 deletions(-)
 create mode 100644 test/src/tree-sitter-tests.el

diff --git a/src/insdel.c b/src/insdel.c
index 3c1e13d38b..b313c50cda 100644
--- a/src/insdel.c
+++ b/src/insdel.c
@@ -947,6 +947,10 @@ insert_1_both (const char *string,
   adjust_point (nchars, nbytes);
 
   check_markers ();
+
+#ifdef HAVE_TREE_SITTER
+      ts_record_change (PT_BYTE - nbytes, PT_BYTE - nbytes, PT_BYTE);
+#endif
 }
 \f
 /* Insert the part of the text of STRING, a Lisp object assumed to be
@@ -1078,6 +1082,10 @@ insert_from_string_1 (Lisp_Object string, ptrdiff_t pos, ptrdiff_t pos_byte,
   adjust_point (nchars, outgoing_nbytes);
 
   check_markers ();
+
+#ifdef HAVE_TREE_SITTER
+  ts_record_change (PT_BYTE - nbytes, PT_BYTE - nbytes, PT_BYTE);
+#endif
 }
 \f
 /* Insert a sequence of NCHARS chars which occupy NBYTES bytes
@@ -1145,6 +1153,10 @@ insert_from_gap (ptrdiff_t nchars, ptrdiff_t nbytes, bool text_at_gap_tail)
     adjust_point (nchars, nbytes);
 
   check_markers ();
+
+#ifdef HAVE_TREE_SITTER
+  ts_record_change (PT_BYTE - nbytes, PT_BYTE - nbytes, nbytes);
+#endif
 }
 \f
 /* Insert text from BUF, NCHARS characters starting at CHARPOS, into the
@@ -1292,6 +1304,11 @@ insert_from_buffer_1 (struct buffer *buf,
   graft_intervals_into_buffer (intervals, PT, nchars, current_buffer, inherit);
 
   adjust_point (nchars, outgoing_nbytes);
+
+#ifdef HAVE_TREE_SITTER
+  ts_record_change (PT_BYTE - outgoing_nbytes,
+		    PT_BYTE - outgoing_nbytes, PT_BYTE);
+#endif
 }
 \f
 /* Record undo information and adjust markers and position keepers for
@@ -1556,6 +1573,11 @@ replace_range (ptrdiff_t from, ptrdiff_t to, Lisp_Object new,
   if (adjust_match_data)
     update_search_regs (from, to, from + SCHARS (new));
 
+
+#ifdef HAVE_TREE_SITTER
+  ts_record_change (from_byte, to_byte, GPT_BYTE);
+#endif
+
   signal_after_change (from, nchars_del, GPT - from);
   update_compositions (from, GPT, CHECK_BORDER);
 }
@@ -1683,6 +1705,11 @@ replace_range_2 (ptrdiff_t from, ptrdiff_t from_byte,
 
   modiff_incr (&MODIFF);
   CHARS_MODIFF = MODIFF;
+
+#ifdef HAVE_TREE_SITTER
+  ts_record_change (from_byte, to_byte, from_byte + insbytes);
+#endif
+
 }
 \f
 /* Delete characters in current buffer
@@ -1893,6 +1920,10 @@ del_range_2 (ptrdiff_t from, ptrdiff_t from_byte,
 
   evaporate_overlays (from);
 
+#ifdef HAVE_TREE_SITTER
+  ts_record_change (from_byte, to_byte, from_byte);
+#endif
+
   return deletion;
 }
 
@@ -2156,11 +2187,6 @@ signal_before_change (ptrdiff_t start_int, ptrdiff_t end_int,
       run_hook (Qfirst_change_hook);
     }
 
-#ifdef HAVE_TREE_SITTER
-  /* FIXME: Is this the best place?  */
-  ts_before_change (start_int, end_int);
-#endif
-
   /* Now run the before-change-functions if any.  */
   if (!NILP (Vbefore_change_functions))
     {
@@ -2214,13 +2240,6 @@ signal_after_change (ptrdiff_t charpos, ptrdiff_t lendel, ptrdiff_t lenins)
   if (inhibit_modification_hooks)
     return;
 
-#ifdef HAVE_TREE_SITTER
-  /* We disrespect combine-after-change, because if we don't record
-     this change, the information that we need (the end byte position
-     of the change) will be lost.  */
-  ts_after_change (charpos, lendel, lenins);
-#endif
-
   /* If we are deferring calls to the after-change functions
      and there are no before-change functions,
      just record the args that we were going to use.  */
diff --git a/src/tree_sitter.c b/src/tree_sitter.c
index 7d1225161c..a6a8912c84 100644
--- a/src/tree_sitter.c
+++ b/src/tree_sitter.c
@@ -32,49 +32,52 @@ Copyright (C) 2021 Free Software Foundation, Inc.
 #include "coding.h"
 #include "tree_sitter.h"
 
-/* parser.h defines a macro ADVANCE that conflicts with alloc.c.   */
+/* parser.h defines a macro ADVANCE that conflicts with alloc.c.  */
 #include <tree_sitter/parser.h>
 
-/* Record the byte position of the end of the (to-be) changed text.
-We have to record it now, because by the time we get to after-change
-hook, the _byte_ position of the end is lost.  */
-void
-ts_before_change (ptrdiff_t start_int, ptrdiff_t end_int)
+DEFUN ("tree-sitter-parser-p",
+       Ftree_sitter_parser_p, Stree_sitter_parser_p, 1, 1, 0,
+       doc: /* Return t if OBJECT is a tree-sitter parser.  */)
+  (Lisp_Object object)
 {
-  /* Iterate through each parser in 'tree-sitter-parser-list' and
-     record the byte position.  There could be better ways to record
-     it than storing the same position in every parser, but this is
-     the most fool-proof way, and I expect a buffer to have only one
-     parser most of the time anyway. */
-  ptrdiff_t beg_byte = CHAR_TO_BYTE (start_int);
-  ptrdiff_t old_end_byte = CHAR_TO_BYTE (end_int);
-  Lisp_Object parser_list = Fsymbol_value (Qtree_sitter_parser_list);
-  while (!NILP (parser_list))
-    {
-      Lisp_Object lisp_parser = Fcar (parser_list);
-      XTS_PARSER (lisp_parser)->edit.start_byte = beg_byte;
-      XTS_PARSER (lisp_parser)->edit.old_end_byte = old_end_byte;
-      parser_list = Fcdr (parser_list);
-    }
+  if (TS_PARSERP (object))
+    return Qt;
+  else
+    return Qnil;
+}
+
+DEFUN ("tree-sitter-node-p",
+       Ftree_sitter_node_p, Stree_sitter_node_p, 1, 1, 0,
+       doc: /* Return t if OBJECT is a tree-sitter node.  */)
+  (Lisp_Object object)
+{
+  if (TS_NODEP (object))
+    return Qt;
+  else
+    return Qnil;
 }
 
 /* Update each parser's tree after the user made an edit.  This
 function does not parse the buffer and only updates the tree. (So it
 should be very fast.)  */
 void
-ts_after_change (ptrdiff_t charpos, ptrdiff_t lendel, ptrdiff_t lenins)
+ts_record_change (ptrdiff_t start_byte, ptrdiff_t old_end_byte,
+		  ptrdiff_t new_end_byte)
 {
-  ptrdiff_t new_end_byte = CHAR_TO_BYTE (charpos + lenins);
   Lisp_Object parser_list = Fsymbol_value (Qtree_sitter_parser_list);
+  TSPoint dummy_point = {0, 0};
+  TSInputEdit edit = {start_byte, old_end_byte, new_end_byte,
+		      dummy_point, dummy_point, dummy_point};
   while (!NILP (parser_list))
     {
       Lisp_Object lisp_parser = Fcar (parser_list);
       TSTree *tree = XTS_PARSER (lisp_parser)->tree;
-      XTS_PARSER (lisp_parser)->edit.new_end_byte = new_end_byte;
       if (tree != NULL)
-	  ts_tree_edit (tree, &XTS_PARSER (lisp_parser)->edit);
+	ts_tree_edit (tree, &edit);
+      XTS_PARSER (lisp_parser)->need_reparse = true;
       parser_list = Fcdr (parser_list);
     }
+
 }
 
 /* Parse the buffer.  We don't parse until we have to. When we have
@@ -82,11 +85,15 @@ ts_after_change (ptrdiff_t charpos, ptrdiff_t lendel, ptrdiff_t lenins)
 void
 ts_ensure_parsed (Lisp_Object parser)
 {
+  if (!XTS_PARSER (parser)->need_reparse)
+    return;
   TSParser *ts_parser = XTS_PARSER (parser)->parser;
   TSTree *tree = XTS_PARSER(parser)->tree;
   TSInput input = XTS_PARSER (parser)->input;
   TSTree *new_tree = ts_parser_parse(ts_parser, tree, input);
+  ts_tree_delete (tree);
   XTS_PARSER (parser)->tree = new_tree;
+  XTS_PARSER (parser)->need_reparse = false;
 }
 
 /* This is the read function provided to tree-sitter to read from a
@@ -96,33 +103,30 @@ ts_ensure_parsed (Lisp_Object parser)
 ts_read_buffer (void *buffer, uint32_t byte_index,
 		TSPoint position, uint32_t *bytes_read)
 {
-  if (! BUFFER_LIVE_P ((struct buffer *) buffer))
+  if (!BUFFER_LIVE_P ((struct buffer *) buffer))
     error ("BUFFER is not live");
 
   ptrdiff_t byte_pos = byte_index + 1;
 
-  // FIXME: Add some boundary checks?
-  /* I believe we can get away with only setting current-buffer
-     and not actually switching to it, like what we did in
-     'make_gap_1'.  */
-  struct buffer *old_buffer = current_buffer;
-  current_buffer = (struct buffer *) buffer;
-
-  /* Read one character.  */
+  /* Read one character.  Tree-sitter wants us to set bytes_read to 0
+     if it reads to the end of buffer.  It doesn't say what it wants
+     for the return value in that case, so we just give it an empty
+     string.  */
   char *beg;
   int len;
-  if (byte_pos >= Z_BYTE)
+  // TODO BUF_ZV_BYTE?
+  if (byte_pos >= BUF_Z_BYTE ((struct buffer *) buffer))
     {
       beg = "";
       len = 0;
     }
   else
     {
-      beg = (char *) BYTE_POS_ADDR (byte_pos);
+      beg = (char *) BUF_BYTE_ADDRESS (buffer, byte_pos);
       len = next_char_len(byte_pos);
     }
   *bytes_read = (uint32_t) len;
-  current_buffer = old_buffer;
+
   return beg;
 }
 
@@ -137,9 +141,7 @@ make_ts_parser (struct buffer *buffer, TSParser *parser, TSTree *tree)
   lisp_parser->tree = tree;
   TSInput input = {buffer, ts_read_buffer, TSInputEncodingUTF8};
   lisp_parser->input = input;
-  TSPoint dummy_point = {0, 0};
-  TSInputEdit edit = {0, 0, 0, dummy_point, dummy_point, dummy_point};
-  lisp_parser->edit = edit;
+  lisp_parser->need_reparse = true;
   return make_lisp_ptr (lisp_parser, Lisp_Vectorlike);
 }
 
@@ -192,6 +194,7 @@ DEFUN ("tree-sitter-parser-root-node",
        doc: /* Return the root node of PARSER.  */)
   (Lisp_Object parser)
 {
+  CHECK_TS_PARSER (parser);
   ts_ensure_parsed(parser);
   TSNode root_node = ts_tree_root_node (XTS_PARSER (parser)->tree);
   return make_ts_node (parser, root_node);
@@ -229,11 +232,29 @@ DEFUN ("tree-sitter-parse", Ftree_sitter_parse, Stree_sitter_parse,
   return lisp_node;
 }
 
+/* Below this point are uninteresting mechanical translations of
+   tree-sitter API.  */
+
+/* Node functions.  */
+
+DEFUN ("tree-sitter-node-type",
+       Ftree_sitter_node_type, Stree_sitter_node_type, 1, 1, 0,
+       doc: /* Return the NODE's type as a symbol.  */)
+  (Lisp_Object node)
+{
+  CHECK_TS_NODE (node);
+  TSNode ts_node = XTS_NODE (node)->node;
+  const char *type = ts_node_type(ts_node);
+  return intern_c_string (type);
+}
+
+
 DEFUN ("tree-sitter-node-string",
        Ftree_sitter_node_string, Stree_sitter_node_string, 1, 1, 0,
        doc: /* Return the string representation of NODE.  */)
   (Lisp_Object node)
 {
+  CHECK_TS_NODE (node);
   TSNode ts_node = XTS_NODE (node)->node;
   char *string = ts_node_string(ts_node);
   return make_string(string, strlen (string));
@@ -242,29 +263,125 @@ DEFUN ("tree-sitter-node-string",
 DEFUN ("tree-sitter-node-parent",
        Ftree_sitter_node_parent, Stree_sitter_node_parent, 1, 1, 0,
        doc: /* Return the immediate parent of NODE.
-Return nil if we couldn't find any.  */)
+Return nil if there isn't any.  */)
   (Lisp_Object node)
 {
+  CHECK_TS_NODE (node);
   TSNode ts_node = XTS_NODE (node)->node;
-  TSNode parent = ts_node_parent(ts_node);
+  TSNode parent = ts_node_parent (ts_node);
 
   if (ts_node_is_null(parent))
     return Qnil;
 
-  return make_ts_node(XTS_NODE (node)->parser, parent);
+  return make_ts_node (XTS_NODE (node)->parser, parent);
 }
 
 DEFUN ("tree-sitter-node-child",
-       Ftree_sitter_node_child, Stree_sitter_node_child, 2, 2, 0,
+       Ftree_sitter_node_child, Stree_sitter_node_child, 2, 3, 0,
        doc: /* Return the Nth child of NODE.
-Return nil if we couldn't find any.  */)
+Return nil if there isn't any.  If NAMED is non-nil, look for named
+child only.  NAMED defaults to nil.  */)
+  (Lisp_Object node, Lisp_Object n, Lisp_Object named)
+{
+  CHECK_TS_NODE (node);
+  CHECK_INTEGER (n);
+  EMACS_INT idx = XFIXNUM (n);
+  TSNode ts_node = XTS_NODE (node)->node;
+  TSNode child;
+  if (NILP (named))
+    child = ts_node_child (ts_node, (uint32_t) idx);
+  else
+    child = ts_node_named_child (ts_node, (uint32_t) idx);
+
+  if (ts_node_is_null(child))
+    return Qnil;
+
+  return make_ts_node(XTS_NODE (node)->parser, child);
+}
+
+DEFUN ("tree-sitter-node-check",
+       Ftree_sitter_node_check, Stree_sitter_node_check, 2, 2, 0,
+       doc: /* Return non-nil if NODE is in condition COND, nil otherwise.
+
+COND could be 'named, 'missing, 'extra, 'has-error.  Named nodes
+correspond to named rules in the grammar, whereas "anonymous" nodes
+correspond to string literals in the grammar.
+
+Missing nodes are inserted by the parser in order to recover from
+certain kinds of syntax errors, i.e., should be there but not there.
+
+Extra nodes represent things like comments, which are not required the
+grammar, but can appear anywhere.
+
+A node "has error" if itself is a syntax error or contains any syntax
+errors.  */)
+  (Lisp_Object node, Lisp_Object cond)
+{
+  CHECK_TS_NODE (node);
+  CHECK_SYMBOL (cond);
+  TSNode ts_node = XTS_NODE (node)->node;
+  bool result;
+  if (EQ (cond, Qnamed))
+    result = ts_node_is_named (ts_node);
+  else if (EQ (cond, Qmissing))
+    result = ts_node_is_missing (ts_node);
+  else if (EQ (cond, Qextra))
+    result = ts_node_is_extra (ts_node);
+  else if (EQ (cond, Qhas_error))
+    result = ts_node_has_error (ts_node);
+  else
+    signal_error ("Expecting one of four symbols, see docstring", cond);
+  return result ? Qt : Qnil;
+}
+
+DEFUN ("tree-sitter-node-field-name-for-child",
+       Ftree_sitter_node_field_name_for_child,
+       Stree_sitter_node_field_name_for_child, 2, 2, 0,
+       doc: /* Return the field name of the Nth child of NODE.
+Return nil if there isn't any child or no field is found.  */)
   (Lisp_Object node, Lisp_Object n)
 {
   CHECK_INTEGER (n);
   EMACS_INT idx = XFIXNUM (n);
   TSNode ts_node = XTS_NODE (node)->node;
-  // FIXME: Is this cast ok?
-  TSNode child = ts_node_child(ts_node, (uint32_t) idx);
+  const char *name
+    = ts_node_field_name_for_child (ts_node, (uint32_t) idx);
+
+  if (name == NULL)
+    return Qnil;
+
+  return make_string (name, strlen (name));
+}
+
+DEFUN ("tree-sitter-node-child-count",
+       Ftree_sitter_node_child_count,
+       Stree_sitter_node_child_count, 1, 2, 0,
+       doc: /* Return the number of children of NODE.
+If NAMED is non-nil, count named child only.  NAMED defaults to
+nil.  */)
+  (Lisp_Object node, Lisp_Object named)
+{
+  TSNode ts_node = XTS_NODE (node)->node;
+  uint32_t count;
+  if (NILP (named))
+    count = ts_node_child_count (ts_node);
+  else
+    count = ts_node_named_child_count (ts_node);
+  return make_fixnum (count);
+}
+
+DEFUN ("tree-sitter-node-child-by-field-name",
+       Ftree_sitter_node_child_by_field_name,
+       Stree_sitter_node_child_by_field_name, 2, 2, 0,
+       doc: /* Return the child of NODE with field name NAME.
+Return nil if there isn't any.  */)
+  (Lisp_Object node, Lisp_Object name)
+{
+  CHECK_STRING (name);
+  char *name_str = SSDATA (name);
+  TSNode ts_node = XTS_NODE (node)->node;
+  TSNode child
+    = ts_node_child_by_field_name (ts_node, name_str, strlen (name_str));
 
   if (ts_node_is_null(child))
     return Qnil;
@@ -272,10 +389,62 @@ DEFUN ("tree-sitter-node-child",
   return make_ts_node(XTS_NODE (node)->parser, child);
 }
 
+DEFUN ("tree-sitter-node-next-sibling",
+       Ftree_sitter_node_next_sibling,
+       Stree_sitter_node_next_sibling, 1, 2, 0,
+       doc: /* Return the next sibling of NODE.
+Return nil if there isn't any.  If NAMED is non-nil, look for named
+child only.  NAMED defaults to nil.  */)
+  (Lisp_Object node, Lisp_Object named)
+{
+  TSNode ts_node = XTS_NODE (node)->node;
+  TSNode sibling;
+  if (NILP (named))
+    sibling = ts_node_next_sibling (ts_node);
+  else
+    sibling = ts_node_next_named_sibling (ts_node);
+
+  if (ts_node_is_null(sibling))
+    return Qnil;
+
+  return make_ts_node(XTS_NODE (node)->parser, sibling);
+}
+
+DEFUN ("tree-sitter-node-prev-sibling",
+       Ftree_sitter_node_prev_sibling,
+       Stree_sitter_node_prev_sibling, 1, 2, 0,
+       doc: /* Return the previous sibling of NODE.
+Return nil if there isn't any.  If NAMED is non-nil, look for named
+child only.  NAMED defaults to nil.  */)
+  (Lisp_Object node, Lisp_Object named)
+{
+  TSNode ts_node = XTS_NODE (node)->node;
+  TSNode sibling;
+
+  if (NILP (named))
+    sibling = ts_node_prev_sibling (ts_node);
+  else
+    sibling = ts_node_prev_named_sibling (ts_node);
+
+  if (ts_node_is_null(sibling))
+    return Qnil;
+
+  return make_ts_node(XTS_NODE (node)->parser, sibling);
+}
+
+/* Query functions */
+
 /* Initialize the tree-sitter routines.  */
 void
 syms_of_tree_sitter (void)
 {
+  DEFSYM (Qtree_sitter_parser_p, "tree-sitter-parser-p");
+  DEFSYM (Qtree_sitter_node_p, "tree-sitter-node-p");
+  DEFSYM (Qnamed, "named");
+  DEFSYM (Qmissing, "missing");
+  DEFSYM (Qextra, "extra");
+  DEFSYM (Qhas_error, "has-error");
+
   DEFSYM (Qtree_sitter_parser_list, "tree-sitter-parser-list");
   DEFVAR_LISP ("ts-parser-list", Vtree_sitter_parser_list,
 		     doc: /* A list of tree-sitter parsers.
@@ -284,11 +453,20 @@ syms_of_tree_sitter (void)
   Vtree_sitter_parser_list = Qnil;
   Fmake_variable_buffer_local (Qtree_sitter_parser_list);
 
-
+  defsubr (&Stree_sitter_parser_p);
+  defsubr (&Stree_sitter_node_p);
   defsubr (&Stree_sitter_create_parser);
   defsubr (&Stree_sitter_parser_root_node);
   defsubr (&Stree_sitter_parse);
+
+  defsubr (&Stree_sitter_node_type);
   defsubr (&Stree_sitter_node_string);
   defsubr (&Stree_sitter_node_parent);
   defsubr (&Stree_sitter_node_child);
+  defsubr (&Stree_sitter_node_check);
+  defsubr (&Stree_sitter_node_field_name_for_child);
+  defsubr (&Stree_sitter_node_child_count);
+  defsubr (&Stree_sitter_node_child_by_field_name);
+  defsubr (&Stree_sitter_node_next_sibling);
+  defsubr (&Stree_sitter_node_prev_sibling);
 }
diff --git a/src/tree_sitter.h b/src/tree_sitter.h
index 0606f336cc..a7e2a2d670 100644
--- a/src/tree_sitter.h
+++ b/src/tree_sitter.h
@@ -37,7 +37,7 @@ #define EMACS_TREE_SITTER_H
   TSParser *parser;
   TSTree *tree;
   TSInput input;
-  TSInputEdit edit;
+  bool need_reparse;
 };
 
 /* A wrapper around a tree-sitter node.  */
@@ -78,11 +78,21 @@ XTS_NODE (Lisp_Object a)
   return XUNTAG (a, Lisp_Vectorlike, struct Lisp_TS_Node);
 }
 
-void
-ts_before_change (ptrdiff_t charpos, ptrdiff_t lendel);
+INLINE void
+CHECK_TS_PARSER (Lisp_Object parser)
+{
+  CHECK_TYPE (TS_PARSERP (parser), Qtree_sitter_parser_p, parser);
+}
+
+INLINE void
+CHECK_TS_NODE (Lisp_Object node)
+{
+  CHECK_TYPE (TS_NODEP (node), Qtree_sitter_node_p, node);
+}
 
 void
-ts_after_change (ptrdiff_t charpos, ptrdiff_t lendel, ptrdiff_t lenins);
+ts_record_change (ptrdiff_t start_byte, ptrdiff_t old_end_byte,
+		  ptrdiff_t new_end_byte);
 
 Lisp_Object
 make_ts_parser (struct buffer *buffer, TSParser *parser, TSTree *tree);
diff --git a/test/src/tree-sitter-tests.el b/test/src/tree-sitter-tests.el
new file mode 100644
index 0000000000..cb1c464d3a
--- /dev/null
+++ b/test/src/tree-sitter-tests.el
@@ -0,0 +1,106 @@
+;;; tree-sitter-tests.el --- tests for src/tree-sitter.c         -*- lexical-binding: t; -*-
+
+;; Copyright (C) 2021 Free Software Foundation, Inc.
+
+;; This file is part of GNU Emacs.
+
+;; GNU Emacs is free software: you can redistribute it and/or modify
+;; it under the terms of the GNU General Public License as published by
+;; the Free Software Foundation, either version 3 of the License, or
+;; (at your option) any later version.
+
+;; GNU Emacs is distributed in the hope that it will be useful,
+;; but WITHOUT ANY WARRANTY; without even the implied warranty of
+;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+;; GNU General Public License for more details.
+
+;; You should have received a copy of the GNU General Public License
+;; along with GNU Emacs.  If not, see <https://www.gnu.org/licenses/>.
+
+;;; Code:
+
+(require 'ert)
+(require 'tree-sitter-json)
+
+(ert-deftest tree-sitter-basic-parsing ()
+  "Test basic parsing routines."
+  (with-temp-buffer
+    (let ((parser (tree-sitter-create-parser
+                   (current-buffer) (tree-sitter-json))))
+      (should
+       (eq parser (car tree-sitter-parser-list)))
+      (should
+       (equal (tree-sitter-node-string
+               (tree-sitter-parser-root-node parser))
+              "(ERROR)"))
+
+      (insert "[1,2,3]")
+      (should
+       (equal (tree-sitter-node-string
+               (tree-sitter-parser-root-node parser))
+              "(document (array (number) (number) (number)))"))
+
+      (goto-char (point-min))
+      (forward-char 3)
+      (insert "{\"name\": \"Bob\"},")
+      (should
+       (equal
+        (tree-sitter-node-string
+         (tree-sitter-parser-root-node parser))
+        "(document (array (number) (object (pair key: (string (string_content)) value: (string (string_content)))) (number) (number)))")))))
+
+(ert-deftest tree-sitter-node-api ()
+  "Tests for node API."
+  (with-temp-buffer
+    (insert "[1,2,{\"name\": \"Bob\"},3]")
+    (let (parser root-node doc-node object-node pair-node)
+      (setq parser (tree-sitter-create-parser
+                    (current-buffer) (tree-sitter-json)))
+      (setq root-node (tree-sitter-parser-root-node
+                       parser))
+      ;; `tree-sitter-node-type'.
+      (should (eq 'document (tree-sitter-node-type root-node)))
+      ;; `tree-sitter-node-check'.
+      (should (eq t (tree-sitter-node-check root-node 'named)))
+      (should (eq nil (tree-sitter-node-check root-node 'missing)))
+      (should (eq nil (tree-sitter-node-check root-node 'extra)))
+      (should (eq nil (tree-sitter-node-check root-node 'has-error)))
+      ;; `tree-sitter-node-child'.
+      (setq doc-node (tree-sitter-node-child root-node 0))
+      (should (eq 'array (tree-sitter-node-type doc-node)))
+      (should (equal (tree-sitter-node-string doc-node)
+                     "(array (number) (number) (object (pair key: (string (string_content)) value: (string (string_content)))) (number))"))
+      ;; `tree-sitter-node-child-count'.
+      (should (eql 9 (tree-sitter-node-child-count doc-node)))
+      (should (eql 4 (tree-sitter-node-child-count doc-node t)))
+      ;; `tree-sitter-node-field-name-for-child'.
+      (setq object-node (tree-sitter-node-child doc-node 2 t))
+      (setq pair-node (tree-sitter-node-child object-node 0 t))
+      (should (eq 'object (tree-sitter-node-type object-node)))
+      (should (eq 'pair (tree-sitter-node-type pair-node)))
+      (should (equal "key"
+                     (tree-sitter-node-field-name-for-child
+                      pair-node 0)))
+      ;; `tree-sitter-node-child-by-field-name'.
+      (should (equal "(string (string_content))"
+                     (tree-sitter-node-string
+                      (tree-sitter-node-child-by-field-name
+                       pair-node "key"))))
+      ;; `tree-sitter-node-next-sibling'.
+      (should (equal "(number)"
+                     (tree-sitter-node-string
+                      (tree-sitter-node-next-sibling object-node t))))
+      (should (equal "(\",\")"
+                     (tree-sitter-node-string
+                      (tree-sitter-node-next-sibling object-node))))
+      ;; `tree-sitter-node-prev-sibling'.
+      (should (equal "(number)"
+                     (tree-sitter-node-string
+                      (tree-sitter-node-prev-sibling object-node t))))
+      (should (equal "(\",\")"
+                     (tree-sitter-node-string
+                      (tree-sitter-node-prev-sibling object-node))))
+      )))
+
+(provide 'tree-sitter-tests)
+;;; tree-sitter-tests.el ends here
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-21 21:54                                             ` Stephen Leake
@ 2021-07-22  4:40                                               ` Clément Pit-Claudel
  0 siblings, 0 replies; 284+ messages in thread
From: Clément Pit-Claudel @ 2021-07-22  4:40 UTC (permalink / raw)
  To: emacs-devel

On 7/21/21 5:54 PM, Stephen Leake wrote:
> It should only freeze write operations on that buffer, so marking it
> read-only while waiting for the parse results might be best.

Yes, I expect that would be much better than what we have.

Thanks for your work on wisi, by the way!



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-21 13:51                                             ` Eli Zaretskii
@ 2021-07-22  4:59                                               ` Clément Pit-Claudel
  2021-07-22  6:38                                                 ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Clément Pit-Claudel @ 2021-07-22  4:59 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

On 7/21/21 9:51 AM, Eli Zaretskii wrote:
>> From: Clément Pit-Claudel <cpitclaudel@gmail.com>
>> Date: Wed, 21 Jul 2021 09:38:31 -0400
>> Cc: emacs-devel@gnu.org
>>
>>    <valuable text><small bit of the gap because the gap moved while TS was scanning><more valuable data>
>>
>> and then, knowing that the gap had moved, re-feeding it just the area that corresponds to the places around the boundaries of the gap might yield a speedup.
> 
> You are assuming that TS will be able to process both <valuable text>
> and <more valuable data>, even though it eats the garbage in the gap?
> That isn't guaranteed, due to possibly invalid byte sequences in the
> gap.

Yes, that's fair.

>> Alternatively, keeping the list of changes allows us to maintain a copy of the buffer that TS uses for scanning, with updates delayed until TS is done scanning.
> 
> Having a copy for each buffer that needs parsing doesn't scale.

Because of time, or because of memory?  

I though we assumed memory was a non-issue, because tree-sitter's data structures seem to require *a lot* more space than the text of the underlying buffer (in 2018 the main dev said "syntax trees still use over 10x as much memory as the size of the source file.").

Copying time can be an issue, for sure, but memcpy() is fast these days ^^




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: [SPAM UNSURE] Re: How to add pseudo vector types
  2021-07-21 19:49                                               ` Eli Zaretskii
@ 2021-07-22  5:09                                                 ` Clément Pit-Claudel
  2021-07-22  6:44                                                   ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Clément Pit-Claudel @ 2021-07-22  5:09 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

On 7/21/21 3:49 PM, Eli Zaretskii wrote:
>> From: Clément Pit-Claudel <cpitclaudel@gmail.com>
>> Date: Wed, 21 Jul 2021 13:12:16 -0400
>>
>> On 7/21/21 12:54 PM, Stephen Leake wrote:
>>> Hmm. Perhaps you are not talking about interrupting the parse; you are
>>> assuming that the parse for each change completes before the next change
>>> arrives.
>>
>> Neither of these.  I'm assuming that you open a file, launch a parse, batch up changes until that first parse completes, then launch a second parse, during which additional changes are batched up, then launch a third parse, etc.
> 
> But how would the "launched parse" access the buffer text if it runs
> in parallel to normal editing?  We've discussed the difficulties with
> that, and you seem to ignore them here?

Lots of magic handwaving: IOW, I don't have a solution, just a general hope that minimal synchronization and decent error recovery would help (for example, maybe it's enough to synchronize only when TS requires a chunk of memory).  But for the discussion above, Stefan's copying solution works fine.

>> Any time you actually need the info (for navigating, or for fontification, or…) then you either use the last parse if it was recent enough, or (more likely) you block until you can complete a synchronous parse.
> 
> Which means the results will many times be slightly wrong, because the
> parse info you use is outdated?

Maybe.  In practice if the delay between requesting the info and getting it is perceptible, then displaying outdated info is better than freezing until you get up-to-date info, no?  Either you're getting info so fast that the user doesn't realize that you're outdated by 1ms; or you're getting info so slowly that the user realizes that you're running one second behind — but it's much better than freezing for one second.

Less relevant details below:

This is a problem we have all the time with Flycheck btw: you send the text of the buffer to a compiler, it returns 3 seconds later, and you want to display errors as reported by the compiler.  By the time we get errors to display, the locations they come with are outdated.  We don't have a good solution.

Visual Studio in the old days had a really beautiful solution for this.  There was a (basically) free API you could call to snapshot a buffer at a point in time; then there was a function that translated positions in that snapshot to position in the current buffer (think of it as magically putting a marker into the past buffer when the snapshot was taken, and then querying its current position).  So positions returned by the compiler or any other tools were still relevant if they referred to a three-seconds old buffer, since you could translate them to the current buffer.

Clément.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-22  4:59                                               ` Clément Pit-Claudel
@ 2021-07-22  6:38                                                 ` Eli Zaretskii
  0 siblings, 0 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-22  6:38 UTC (permalink / raw)
  To: Clément Pit-Claudel; +Cc: emacs-devel

> Cc: emacs-devel@gnu.org
> From: Clément Pit-Claudel <cpitclaudel@gmail.com>
> Date: Thu, 22 Jul 2021 00:59:31 -0400
> 
> >> Alternatively, keeping the list of changes allows us to maintain a copy of the buffer that TS uses for scanning, with updates delayed until TS is done scanning.
> > 
> > Having a copy for each buffer that needs parsing doesn't scale.
> 
> Because of time, or because of memory?  

Memory, mostly.

> I though we assumed memory was a non-issue, because tree-sitter's data structures seem to require *a lot* more space than the text of the underlying buffer (in 2018 the main dev said "syntax trees still use over 10x as much memory as the size of the source file.").

You are talking about _adding_ to that another copy of the buffer's
text, which could be many megabytes.  And your proposal means we will
have such copies for many buffers.

As for the TS memory requirements, if they really need 1GB for a 100MB
file (I doubt that), then TS is probably not a good candidate for
Emacs.

> Copying time can be an issue, for sure, but memcpy() is fast these days ^^

You forget the time needed to allocate the memory for the copy, that
could be orders of magnitude slower for large buffers, especially if
there's a lot of memory pressure on the OS.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: [SPAM UNSURE] Re: How to add pseudo vector types
  2021-07-22  5:09                                                 ` Clément Pit-Claudel
@ 2021-07-22  6:44                                                   ` Eli Zaretskii
  2021-07-22 14:43                                                     ` Clément Pit-Claudel
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-22  6:44 UTC (permalink / raw)
  To: Clément Pit-Claudel; +Cc: emacs-devel

> Cc: emacs-devel@gnu.org
> From: Clément Pit-Claudel <cpitclaudel@gmail.com>
> Date: Thu, 22 Jul 2021 01:09:13 -0400
> 
> > But how would the "launched parse" access the buffer text if it runs
> > in parallel to normal editing?  We've discussed the difficulties with
> > that, and you seem to ignore them here?
> 
> Lots of magic handwaving:

Hand-waving always works, of course.

> > Which means the results will many times be slightly wrong, because the
> > parse info you use is outdated?
> 
> Maybe.  In practice if the delay between requesting the info and getting it is perceptible, then displaying outdated info is better than freezing until you get up-to-date info, no?  Either you're getting info so fast that the user doesn't realize that you're outdated by 1ms; or you're getting info so slowly that the user realizes that you're running one second behind — but it's much better than freezing for one second.

The time doesn't matter here: the amount of changes does.

IME, display based on outdated information is NOT okay.  The
discrepancy between the actual stuff on the screen and its
fontification based on outdated buffer context could be quite
annoying, for example.

> This is a problem we have all the time with Flycheck btw: you send the text of the buffer to a compiler, it returns 3 seconds later, and you want to display errors as reported by the compiler.  By the time we get errors to display, the locations they come with are outdated.  We don't have a good solution.
> 
> Visual Studio in the old days had a really beautiful solution for this.  There was a (basically) free API you could call to snapshot a buffer at a point in time; then there was a function that translated positions in that snapshot to position in the current buffer (think of it as magically putting a marker into the past buffer when the snapshot was taken, and then querying its current position).  So positions returned by the compiler or any other tools were still relevant if they referred to a three-seconds old buffer, since you could translate them to the current buffer.

We can do that as well.  It's again something very similar to the
undo-list info we already collect.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-22  3:10                                         ` Yuan Fu
@ 2021-07-22  8:23                                           ` Eli Zaretskii
  2021-07-22 13:47                                             ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-22  8:23 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Wed, 21 Jul 2021 23:10:14 -0400
> Cc: Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  emacs-devel <emacs-devel@gnu.org>
> 
> Should I keep posting patches, or should I create a branch in /scratch?

The latter, I think.

> If the latter, how do I do it?

You need write access to the Emacs repository.

> @@ -96,33 +103,30 @@ ts_ensure_parsed (Lisp_Object parser)
>  ts_read_buffer (void *buffer, uint32_t byte_index,
>  		TSPoint position, uint32_t *bytes_read)
>  {
> -  if (! BUFFER_LIVE_P ((struct buffer *) buffer))
> +  if (!BUFFER_LIVE_P ((struct buffer *) buffer))
>      error ("BUFFER is not live");

Is it really TRT to signal an error here?  This is not code that would
run from a user command, so signaling an error is not necessarily the
useful response to this situation.  Why not simply return without
doing anything?

> +  // TODO BUF_ZV_BYTE?

Do you want to discuss this?  I'd prefer to have it the other way
around: use BUF_ZV_BYTE by default.  The callers could widen the
buffer if they needed to access outside of the narrowing.

>    else
>      {
> -      beg = (char *) BYTE_POS_ADDR (byte_pos);
> +      beg = (char *) BUF_BYTE_ADDRESS (buffer, byte_pos);
>        len = next_char_len(byte_pos);

The last line is incorrect, as it assumes the current buffer.  You
actually don't need that function, it's enough to use
BYTES_BY_CHAR_HEAD on the address in 'beg'.

>    *bytes_read = (uint32_t) len;

Is using uint32_t the restriction of tree-sitter?  Doesn't it support
reading more than 2 gigabytes?

> +DEFUN ("tree-sitter-node-type",
> +       Ftree_sitter_node_type, Stree_sitter_node_type, 1, 1, 0,
> +       doc: /* Return the NODE's type as a symbol.  */)
> +  (Lisp_Object node)
> +{
> +  CHECK_TS_NODE (node);
> +  TSNode ts_node = XTS_NODE (node)->node;
> +  const char *type = ts_node_type(ts_node);
> +  return intern_c_string (type);

Why do we need to intern the string each time? can't we store the
interned symbol there, instead of a C string, in the first place?

Thanks.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-22  8:23                                           ` Eli Zaretskii
@ 2021-07-22 13:47                                             ` Yuan Fu
  2021-07-22 14:11                                               ` Óscar Fuentes
                                                                 ` (2 more replies)
  0 siblings, 3 replies; 284+ messages in thread
From: Yuan Fu @ 2021-07-22 13:47 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Clément Pit-Claudel, Stefan Monnier, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 3092 bytes --]


> 
>> +  // TODO BUF_ZV_BYTE?
> 
> Do you want to discuss this?  I'd prefer to have it the other way
> around: use BUF_ZV_BYTE by default.  The callers could widen the
> buffer if they needed to access outside of the narrowing.

Yes, I meant to discuss this. The problem with respecting narrowing is that, a user can freely narrow and widen arbitrarily, and Emacs needs to translate them into insertion & deletion of the buffer text for tree-sitter, every time a user narrows or widens the buffer. Plus, if tree-sitter respects narrowing, it could happen where a user narrows the buffer, the font-locking changes and is not correct anymore. Maybe that’s not the user want. Also, if someone narrows and widens often, maybe narrow to a function for better focus, tree-sitter needs to constantly re-parse most of the buffer. These are not significant disadvantages, but what do we get from respecting narrowing that justifies code complexity and these small annoyances?

>>   *bytes_read = (uint32_t) len;
> 
> Is using uint32_t the restriction of tree-sitter?  Doesn't it support
> reading more than 2 gigabytes?

I’m not sure why it asks for uint32 specifically, but that’s what it asks for its api. I don’t think you are supposed to use tree-sitter on files of size of gigabytes, because the author mentioned that tree-sitter uses over 10x as much memory as the size of the source file [1]. On files larger than a couple of megabytes, I think we better turn off tree-sitter. Normally those files are not regular source files, anyway, and we don’t need a parse tree for a log.

That leads to another point. I suspect the memory limit will come before the speed limit, i.e., as the file size increases, the memory consumption will become unacceptable before the speed does. So it is possible that we want to outright disable tree-sitter for larger files, then we don’t need to do much to improve the responsiveness of tree-sitter on large files. And we might want to delete the parse tree if a buffer has been idle for a while. Of course, that’s just my superstition, we’ll see once we can measure the performance.

> 
>> +DEFUN ("tree-sitter-node-type",
>> +       Ftree_sitter_node_type, Stree_sitter_node_type, 1, 1, 0,
>> +       doc: /* Return the NODE's type as a symbol.  */)
>> +  (Lisp_Object node)
>> +{
>> +  CHECK_TS_NODE (node);
>> +  TSNode ts_node = XTS_NODE (node)->node;
>> +  const char *type = ts_node_type(ts_node);
>> +  return intern_c_string (type);
> 
> Why do we need to intern the string each time? can't we store the
> interned symbol there, instead of a C string, in the first place?

I’m not sure what do you mean by “store the interned symbol there”, where do I store the interned symbol? (BTW, If you see something wrong, that’s probably because I don’t know the right way to do it, and grepping only got me that far.)

[1]: https://github.com/tree-sitter/tree-sitter/issues/222#issuecomment-435987441 <https://github.com/tree-sitter/tree-sitter/issues/222#issuecomment-435987441>

Thanks,
Yuan


[-- Attachment #2: Type: text/html, Size: 4440 bytes --]

^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-22 13:47                                             ` Yuan Fu
@ 2021-07-22 14:11                                               ` Óscar Fuentes
  2021-07-22 17:09                                                 ` Eli Zaretskii
  2021-07-22 17:00                                               ` Eli Zaretskii
  2021-07-24  9:33                                               ` Stephen Leake
  2 siblings, 1 reply; 284+ messages in thread
From: Óscar Fuentes @ 2021-07-22 14:11 UTC (permalink / raw)
  To: emacs-devel

Yuan Fu <casouri@gmail.com> writes:

> That leads to another point. I suspect the memory limit will come
> before the speed limit, i.e., as the file size increases, the memory
> consumption will become unacceptable before the speed does. So it is
> possible that we want to outright disable tree-sitter for larger
> files, then we don’t need to do much to improve the responsiveness of
> tree-sitter on large files. And we might want to delete the parse tree
> if a buffer has been idle for a while. Of course, that’s just my
> superstition, we’ll see once we can measure the performance.

Of course those parameters would be configurable on Emacs, but disabling
TS on a 2MB file because it uses 20MB is way too conservative, IMHO.

Nowadays the cheapest netbook comes with at least 1GB RAM and can do
memory-to-memory copies at a rate of GB/s.

Guys, you are speculating too much about minutia and worst-case
scenarios. (Do we really care about TS not supporting files larger than
4GB? I mean, REALLY?)

I'll rather focus on implementing the thing and optimize later. My bet
is that a crude implementation would work fine for the 99% of the users
and be an improvement over what we have now on practically all cases.

BTW, a 10x AST/source-code size ratio is quite reasonable.




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: [SPAM UNSURE] Re: How to add pseudo vector types
  2021-07-22  6:44                                                   ` Eli Zaretskii
@ 2021-07-22 14:43                                                     ` Clément Pit-Claudel
  0 siblings, 0 replies; 284+ messages in thread
From: Clément Pit-Claudel @ 2021-07-22 14:43 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

On 7/22/21 2:44 AM, Eli Zaretskii wrote:
> We can do that as well.  It's again something very similar to the
> undo-list info we already collect.

Yes, the last discussion about this didn't end too well :'( And I haven't had time to work on it.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-22 13:47                                             ` Yuan Fu
  2021-07-22 14:11                                               ` Óscar Fuentes
@ 2021-07-22 17:00                                               ` Eli Zaretskii
  2021-07-22 17:47                                                 ` Yuan Fu
                                                                   ` (2 more replies)
  2021-07-24  9:33                                               ` Stephen Leake
  2 siblings, 3 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-22 17:00 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Thu, 22 Jul 2021 09:47:45 -0400
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  emacs-devel@gnu.org
> 
> Yes, I meant to discuss this. The problem with respecting narrowing is that, a user can freely narrow and widen arbitrarily, and Emacs needs to translate them into insertion & deletion of the buffer text for tree-sitter, every time a user narrows or widens the buffer. Plus, if tree-sitter respects narrowing, it could happen where a user narrows the buffer, the font-locking changes and is not correct anymore. Maybe that’s not the user want. Also, if someone narrows and widens often, maybe narrow to a function for better focus, tree-sitter needs to constantly re-parse most of the buffer. These are not significant disadvantages, but what do we get from respecting narrowing that justifies code complexity and these small annoyances?

But that's how the current font-lock and indentation work: they never
look beyond the narrowing limits.  So why should the TS-based features
behave differently?

As for temporary narrowing: if we record the changes, but don't send
them to TS until we actually need re-parsing, then we could eliminate
the temporary narrowing when we report the changes to TS, leaving only
the narrowing that exists at the time of the re-parse.  At least for
fontifications, that time is redisplay time, and users do expect to
see the text fontified according to the current narrowing.

> >>   *bytes_read = (uint32_t) len;
> > 
> > Is using uint32_t the restriction of tree-sitter?  Doesn't it support
> > reading more than 2 gigabytes?
> 
> I’m not sure why it asks for uint32 specifically, but that’s what it asks for its api. I don’t think you are supposed to use tree-sitter on files of size of gigabytes, because the author mentioned that tree-sitter uses over 10x as much memory as the size of the source file [1]. On files larger than a couple of megabytes, I think we better turn off tree-sitter. Normally those files are not regular source files, anyway, and we don’t need a parse tree for a log.

I don't necessarily agree with the "not regular source files" part.
For example, JSON files can be quite large.  And there are also log
files, which are even larger -- did no one adapt TS to fontifying
those yet?

More generally: is the problem real?  If you make a file that is 1000
copies of xdisp.c, and then submit it to TS, do you really get 10GB of
memory consumption?  This is something that is good to know up front,
so we'd know what to expect down the road.

> That leads to another point. I suspect the memory limit will come before the speed limit, i.e., as the file size increases, the memory consumption will become unacceptable before the speed does. So it is possible that we want to outright disable tree-sitter for larger files, then we don’t need to do much to improve the responsiveness of tree-sitter on large files. And we might want to delete the parse tree if a buffer has been idle for a while. Of course, that’s just my superstition, we’ll see once we can measure the performance.

See above: IMO, we should benchmark both the CPU and memory
performance of TS for such large files, before we decide on the course
of action.

> >> +DEFUN ("tree-sitter-node-type",
> >> +       Ftree_sitter_node_type, Stree_sitter_node_type, 1, 1, 0,
> >> +       doc: /* Return the NODE's type as a symbol.  */)
> >> +  (Lisp_Object node)
> >> +{
> >> +  CHECK_TS_NODE (node);
> >> +  TSNode ts_node = XTS_NODE (node)->node;
> >> +  const char *type = ts_node_type(ts_node);
> >> +  return intern_c_string (type);
> > 
> > Why do we need to intern the string each time? can't we store the
> > interned symbol there, instead of a C string, in the first place?
> 
> I’m not sure what do you mean by “store the interned symbol there”, where do I store the interned symbol?

In the struct that ts_node_type accesses, instead of the 'char *'
string you store there now.

> (BTW, If you see something wrong, that’s probably because I don’t know the right way to do it, and grepping only got me that far.)

Do what? feel free to ask questions when you aren't sure how to
accomplish something on the C level.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-22 14:11                                               ` Óscar Fuentes
@ 2021-07-22 17:09                                                 ` Eli Zaretskii
  2021-07-22 19:29                                                   ` Óscar Fuentes
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-22 17:09 UTC (permalink / raw)
  To: Óscar Fuentes; +Cc: emacs-devel

> From: Óscar Fuentes <ofv@wanadoo.es>
> Date: Thu, 22 Jul 2021 16:11:09 +0200
> 
> Yuan Fu <casouri@gmail.com> writes:
> 
> > That leads to another point. I suspect the memory limit will come
> > before the speed limit, i.e., as the file size increases, the memory
> > consumption will become unacceptable before the speed does. So it is
> > possible that we want to outright disable tree-sitter for larger
> > files, then we don’t need to do much to improve the responsiveness of
> > tree-sitter on large files. And we might want to delete the parse tree
> > if a buffer has been idle for a while. Of course, that’s just my
> > superstition, we’ll see once we can measure the performance.
> 
> Of course those parameters would be configurable on Emacs, but disabling
> TS on a 2MB file because it uses 20MB is way too conservative, IMHO.

Why would we limit ourselves to 20MB?  uint32_t supports upto 4GB.

> Guys, you are speculating too much about minutia and worst-case
> scenarios. (Do we really care about TS not supporting files larger than
> 4GB? I mean, REALLY?)

Yes, we do.  For at least 2 reasons: (a) source code files produced by
programs can be very large; (b) having a feature that fails before you
reach the max size of a buffer Emacs supports is a problem, because it
will cause hard-to-deal-with problems.

Or let me turn the table and ask why we cared to support the largest
possible buffer size when 32-bit systems were the rule?

> I'll rather focus on implementing the thing and optimize later. My bet
> is that a crude implementation would work fine for the 99% of the users
> and be an improvement over what we have now on practically all cases.

This is not a prototype project.  (Or at least I hope it won't end up
being that.)  This is supposed to be the industry-strength code that
core Emacs will use for the years to come to support features which
need language-dependent parsing.  It cannot work correctly only in 99%
of use cases.  So we must assess the limitations seriously and plan
ahead for them.

> BTW, a 10x AST/source-code size ratio is quite reasonable.

It could be, but please don't forget that this is _in_addition_to_ the
"normal" Emacs memory footprint, and that could easily be 1GB and
sometimes several times that.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-22 17:00                                               ` Eli Zaretskii
@ 2021-07-22 17:47                                                 ` Yuan Fu
  2021-07-22 19:05                                                   ` Eli Zaretskii
  2021-07-23 14:07                                                 ` Stefan Monnier
  2021-07-24  9:42                                                 ` Stephen Leake
  2 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-07-22 17:47 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Clément Pit-Claudel, Stefan Monnier, emacs-devel



> On Jul 22, 2021, at 1:00 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: Yuan Fu <casouri@gmail.com>
>> Date: Thu, 22 Jul 2021 09:47:45 -0400
>> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,
>> Clément Pit-Claudel <cpitclaudel@gmail.com>,
>> emacs-devel@gnu.org
>> 
>> Yes, I meant to discuss this. The problem with respecting narrowing is that, a user can freely narrow and widen arbitrarily, and Emacs needs to translate them into insertion & deletion of the buffer text for tree-sitter, every time a user narrows or widens the buffer. Plus, if tree-sitter respects narrowing, it could happen where a user narrows the buffer, the font-locking changes and is not correct anymore. Maybe that’s not the user want. Also, if someone narrows and widens often, maybe narrow to a function for better focus, tree-sitter needs to constantly re-parse most of the buffer. These are not significant disadvantages, but what do we get from respecting narrowing that justifies code complexity and these small annoyances?
> 
> But that's how the current font-lock and indentation work: they never
> look beyond the narrowing limits.  So why should the TS-based features
> behave differently?
> 
> As for temporary narrowing: if we record the changes, but don't send
> them to TS until we actually need re-parsing, then we could eliminate
> the temporary narrowing when we report the changes to TS, leaving only
> the narrowing that exists at the time of the re-parse.  At least for
> fontifications, that time is redisplay time, and users do expect to
> see the text fontified according to the current narrowing.



> 
>>>>  *bytes_read = (uint32_t) len;
>>> 
>>> Is using uint32_t the restriction of tree-sitter?  Doesn't it support
>>> reading more than 2 gigabytes?
>> 
>> I’m not sure why it asks for uint32 specifically, but that’s what it asks for its api. I don’t think you are supposed to use tree-sitter on files of size of gigabytes, because the author mentioned that tree-sitter uses over 10x as much memory as the size of the source file [1]. On files larger than a couple of megabytes, I think we better turn off tree-sitter. Normally those files are not regular source files, anyway, and we don’t need a parse tree for a log.
> 
> I don't necessarily agree with the "not regular source files" part.
> For example, JSON files can be quite large.  And there are also log
> files, which are even larger -- did no one adapt TS to fontifying
> those yet?

There is a JSON parser, but I don’t think there is one for log files.

> 
> More generally: is the problem real?  If you make a file that is 1000
> copies of xdisp.c, and then submit it to TS, do you really get 10GB of
> memory consumption?  This is something that is good to know up front,
> so we'd know what to expect down the road.

Yes. I concatenated 100 xdisp.c together, and parsed them with my simple C program. It used 1.8 G. I didn’t test for 1000 together, but I think the trend is linear.

time -l ./main-large-c
       16.48 real        15.32 user         0.81 sys
          1883959296  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              459951  page reclaims
                  22  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   6  voluntary context switches
                1653  involuntary context switches
        107310143182  instructions retired
         58561420060  cycles elapsed
          1883095040  peak memory footprint

> 
>> That leads to another point. I suspect the memory limit will come before the speed limit, i.e., as the file size increases, the memory consumption will become unacceptable before the speed does. So it is possible that we want to outright disable tree-sitter for larger files, then we don’t need to do much to improve the responsiveness of tree-sitter on large files. And we might want to delete the parse tree if a buffer has been idle for a while. Of course, that’s just my superstition, we’ll see once we can measure the performance.
> 
> See above: IMO, we should benchmark both the CPU and memory
> performance of TS for such large files, before we decide on the course
> of action.

That’s my thought, too. I should have reserved my suspicion until I have benchmark measurements.

> 
>>>> +DEFUN ("tree-sitter-node-type",
>>>> +       Ftree_sitter_node_type, Stree_sitter_node_type, 1, 1, 0,
>>>> +       doc: /* Return the NODE's type as a symbol.  */)
>>>> +  (Lisp_Object node)
>>>> +{
>>>> +  CHECK_TS_NODE (node);
>>>> +  TSNode ts_node = XTS_NODE (node)->node;
>>>> +  const char *type = ts_node_type(ts_node);
>>>> +  return intern_c_string (type);
>>> 
>>> Why do we need to intern the string each time? can't we store the
>>> interned symbol there, instead of a C string, in the first place?
>> 
>> I’m not sure what do you mean by “store the interned symbol there”, where do I store the interned symbol?
> 
> In the struct that ts_node_type accesses, instead of the 'char *'
> string you store there now.

The struct that ts_node_type accesses is a TSNode, which is defined by tree-sitter. ts_node_type is an API provided by tree-sitter, I’m just exposing it to lisp. I could return strings instead of symbols, but I thought symbols might be more appropriate and more convenient for users of this function. 

>> (BTW, If you see something wrong, that’s probably because I don’t know the right way to do it, and grepping only got me that far.)
> 
> Do what? feel free to ask questions when you aren't sure how to
> accomplish something on the C level.

Thanks. Is below the correct way to set a buffer-local variable? (I’m setting tree-sitter-parser-list.)

struct buffer *old_buffer = current_buffer;
  set_buffer_internal (XBUFFER (buffer));

  Fset (Qtree_sitter_parser_list,
	Fcons (lisp_parser, Fsymbol_value (Qtree_sitter_parser_list)));

  set_buffer_internal (old_buffer);

Also, we don’t call change hooks in replace_range_2, why? Should I update tree-sitter trees in that function, or should I not?

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-22 17:47                                                 ` Yuan Fu
@ 2021-07-22 19:05                                                   ` Eli Zaretskii
  2021-07-23 13:25                                                     ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-22 19:05 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Thu, 22 Jul 2021 13:47:20 -0400
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  emacs-devel@gnu.org
> 
> > More generally: is the problem real?  If you make a file that is 1000
> > copies of xdisp.c, and then submit it to TS, do you really get 10GB of
> > memory consumption?  This is something that is good to know up front,
> > so we'd know what to expect down the road.
> 
> Yes. I concatenated 100 xdisp.c together, and parsed them with my simple C program. It used 1.8 G. I didn’t test for 1000 together, but I think the trend is linear.

That's good to know, thanks.

So what does TS do if it attempts to allocate more memory and that
fails?  Regardless, we'd need some fallback strategy, because AFAIU
many people run with VM overcommit enabled, so the OOM killer will
just kill the Emacs process when it asks for too much memory.

> >>>> +DEFUN ("tree-sitter-node-type",
> >>>> +       Ftree_sitter_node_type, Stree_sitter_node_type, 1, 1, 0,
> >>>> +       doc: /* Return the NODE's type as a symbol.  */)
> >>>> +  (Lisp_Object node)
> >>>> +{
> >>>> +  CHECK_TS_NODE (node);
> >>>> +  TSNode ts_node = XTS_NODE (node)->node;
> >>>> +  const char *type = ts_node_type(ts_node);
> >>>> +  return intern_c_string (type);
> >>> 
> >>> Why do we need to intern the string each time? can't we store the
> >>> interned symbol there, instead of a C string, in the first place?
> >> 
> >> I’m not sure what do you mean by “store the interned symbol there”, where do I store the interned symbol?
> > 
> > In the struct that ts_node_type accesses, instead of the 'char *'
> > string you store there now.
> 
> The struct that ts_node_type accesses is a TSNode, which is defined by tree-sitter. ts_node_type is an API provided by tree-sitter, I’m just exposing it to lisp. I could return strings instead of symbols, but I thought symbols might be more appropriate and more convenient for users of this function. 

Maybe there's a better way of exposing that to Lisp.  But that's a
minor point, it can be left for later.

> Is below the correct way to set a buffer-local variable? (I’m setting tree-sitter-parser-list.)
> 
> struct buffer *old_buffer = current_buffer;
>   set_buffer_internal (XBUFFER (buffer));
> 
>   Fset (Qtree_sitter_parser_list,
> 	Fcons (lisp_parser, Fsymbol_value (Qtree_sitter_parser_list)));
> 
>   set_buffer_internal (old_buffer);

Yes, but it would be better to use DEFVAR_LISP and then you could
assign directly to Vtree_sitter_parser_list, instead of using Fset.

> Also, we don’t call change hooks in replace_range_2, why?

Because it is called in a loop, one character at a time.  The caller
of replace_range_2 calls these hooks for the entire region, once.

> Should I update tree-sitter trees in that function, or should I not?

The only caller is casify_region, so you could update there.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-22 17:09                                                 ` Eli Zaretskii
@ 2021-07-22 19:29                                                   ` Óscar Fuentes
  2021-07-23  5:21                                                     ` Eli Zaretskii
  2021-07-24  9:38                                                     ` Stephen Leake
  0 siblings, 2 replies; 284+ messages in thread
From: Óscar Fuentes @ 2021-07-22 19:29 UTC (permalink / raw)
  To: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> Of course those parameters would be configurable on Emacs, but disabling
>> TS on a 2MB file because it uses 20MB is way too conservative, IMHO.
>
> Why would we limit ourselves to 20MB?  uint32_t supports upto 4GB.

I didn't suggest that we should limit ourselves to 20MB, I observed that
current machines have enough resources for handling large files ("large"
meaning "big enough to keep me busy reading for some years.")

>> Guys, you are speculating too much about minutia and worst-case
>> scenarios. (Do we really care about TS not supporting files larger than
>> 4GB? I mean, REALLY?)
>
> Yes, we do.  For at least 2 reasons: (a) source code files produced by
> programs can be very large;

I know, I work with machine-generated (read: code-dense) 20+MB C++ files
on a regular basis.

However, I wouldn't agree on renouncing to useful features because they
could be problematic when dealing with large files. That is, it would be
a mistake to discard TS as inadequate for Emacs just because it doesn't
benefit (and I say "not benefit", not "penalise") certain use cases.

> (b) having a feature that fails before you
> reach the max size of a buffer Emacs supports is a problem, because it
> will cause hard-to-deal-with problems.

We can put reasonable limits on when to use TS once we have some
experience with it. What matters right now is if TS would be usable for
the typical use case, and I guess the answer is positive. Also, it is
not as if we had other options to consider.

>> I'll rather focus on implementing the thing and optimize later. My bet
>> is that a crude implementation would work fine for the 99% of the users
>> and be an improvement over what we have now on practically all cases.
>
> This is not a prototype project.  (Or at least I hope it won't end up
> being that.)  This is supposed to be the industry-strength code that
> core Emacs will use for the years to come to support features which
> need language-dependent parsing.  It cannot work correctly only in 99%
> of use cases.  So we must assess the limitations seriously and plan
> ahead for them.

I said "would work *fine* for the 99% of users", this does not imply
that it would work incorrectly for the rest.

On the "planning ahead" part, TS support would be an optional,
quasi-external feature for some time, it is not as if it comes out with
some critical bug Emacs would become unusable. TS support can be
fine-tuned without disrupting the rest of Emacs development. If, on the
other hand, we start making changes on Emacs' internals for allowing
some TS-related optimizations (even when we don't know if they are
neccessary at all) that could be a destabilizing move for Emacs as a
whole. Apart from delaying TS support.

>> BTW, a 10x AST/source-code size ratio is quite reasonable.
>
> It could be, but please don't forget that this is _in_addition_to_ the
> "normal" Emacs memory footprint, and that could easily be 1GB and
> sometimes several times that.

Yes, but if you want something you need to pay something, and you can
hardly get TS' features with less than that. At least for complex
languages like C++.

Talking about scenarios of heavy memory usage, I'll comment in passing
that in my recent experience, once Emacs exceeds 2GB the gc pauses start
to be so annoying that I don't care anymore about how much memory an
external tool would use if it works fast enough.




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-22 19:29                                                   ` Óscar Fuentes
@ 2021-07-23  5:21                                                     ` Eli Zaretskii
  2021-07-24  9:38                                                     ` Stephen Leake
  1 sibling, 0 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-23  5:21 UTC (permalink / raw)
  To: Óscar Fuentes; +Cc: emacs-devel

> From: Óscar Fuentes <ofv@wanadoo.es>
> Date: Thu, 22 Jul 2021 21:29:02 +0200
> 
> >> Guys, you are speculating too much about minutia and worst-case
> >> scenarios. (Do we really care about TS not supporting files larger than
> >> 4GB? I mean, REALLY?)
> >
> > Yes, we do.  For at least 2 reasons: (a) source code files produced by
> > programs can be very large;
> 
> I know, I work with machine-generated (read: code-dense) 20+MB C++ files
> on a regular basis.
> 
> However, I wouldn't agree on renouncing to useful features because they
> could be problematic when dealing with large files. That is, it would be
> a mistake to discard TS as inadequate for Emacs just because it doesn't
> benefit (and I say "not benefit", not "penalise") certain use cases.

It was not my intent to say we should discard TS as inadequate because
of these limitations.  What I meant is that we should know about the
limitations and plan in advance how to handle them when a user bumps
into them.  Disabling TS-related features could be one such
mitigation, but maybe we could come up with smarter fallbacks.

It sounds like the rest of you message was to convince me not to give
up on TS, in which case there's no need: I'm convinced already, and
mostly agree with what you say.

> Talking about scenarios of heavy memory usage, I'll comment in passing
> that in my recent experience, once Emacs exceeds 2GB the gc pauses start
> to be so annoying that I don't care anymore about how much memory an
> external tool would use if it works fast enough.

That's a separate issue.  And the amount of memory GC has to scan is
not directly related to the memory footprint of the Emacs process.  So
I would be interested in seeing the results of memory-report in those
cases where GC takes too long (in a separate thread, please).



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-22 19:05                                                   ` Eli Zaretskii
@ 2021-07-23 13:25                                                     ` Yuan Fu
  2021-07-23 19:10                                                       ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-07-23 13:25 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Clément Pit-Claudel, Stefan Monnier, emacs-devel



> On Jul 22, 2021, at 3:05 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: Yuan Fu <casouri@gmail.com>
>> Date: Thu, 22 Jul 2021 13:47:20 -0400
>> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,
>> Clément Pit-Claudel <cpitclaudel@gmail.com>,
>> emacs-devel@gnu.org
>> 
>>> More generally: is the problem real?  If you make a file that is 1000
>>> copies of xdisp.c, and then submit it to TS, do you really get 10GB of
>>> memory consumption?  This is something that is good to know up front,
>>> so we'd know what to expect down the road.
>> 
>> Yes. I concatenated 100 xdisp.c together, and parsed them with my simple C program. It used 1.8 G. I didn’t test for 1000 together, but I think the trend is linear.
> 
> That's good to know, thanks.
> 
> So what does TS do if it attempts to allocate more memory and that
> fails?  Regardless, we'd need some fallback strategy, because AFAIU
> many people run with VM overcommit enabled, so the OOM killer will
> just kill the Emacs process when it asks for too much memory.

Abort, it seems:

static inline void *ts_malloc_default(size_t size) {
  void *result = malloc(size);
  if (size > 0 && !result) {
    fprintf(stderr, "tree-sitter failed to allocate %zu bytes", size);
    exit(1);
  }
  return result;
}

>> Also, we don’t call change hooks in replace_range_2, why?
> 
> Because it is called in a loop, one character at a time.  The caller
> of replace_range_2 calls these hooks for the entire region, once.
> 
>> Should I update tree-sitter trees in that function, or should I not?
> 
> The only caller is casify_region, so you could update there.

casify_region doesn’t have access to byte positions. I’ll leave it as-is, recording change in replace_range_2, if you don’t object to it.

Yuan




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-22 17:00                                               ` Eli Zaretskii
  2021-07-22 17:47                                                 ` Yuan Fu
@ 2021-07-23 14:07                                                 ` Stefan Monnier
  2021-07-23 14:45                                                   ` Yuan Fu
  2021-07-23 19:13                                                   ` Eli Zaretskii
  2021-07-24  9:42                                                 ` Stephen Leake
  2 siblings, 2 replies; 284+ messages in thread
From: Stefan Monnier @ 2021-07-23 14:07 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Yuan Fu, cpitclaudel, emacs-devel

> But that's how the current font-lock and indentation work: they never
> look beyond the narrowing limits.

Not quite: that's true for indentation, but for font-lock we have
`font-lock-dont-widen` (i.e. by default, font-lock widens temporarily
while it does its job).

For TS, given the cost associated with changing the bounds, I think it
would make a lot of sense to ignore narrowing (and maybe provide some
separate way to specify bounds, for the rare cases like Info and Rmail
where a buffer contains "a collection of things" and we only want to
parse/manipulate one of those things at any given time).


        Stefan




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-23 14:07                                                 ` Stefan Monnier
@ 2021-07-23 14:45                                                   ` Yuan Fu
  2021-07-23 19:13                                                   ` Eli Zaretskii
  1 sibling, 0 replies; 284+ messages in thread
From: Yuan Fu @ 2021-07-23 14:45 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Eli Zaretskii, Clément Pit-Claudel, emacs-devel



> On Jul 23, 2021, at 10:07 AM, Stefan Monnier <monnier@iro.umontreal.ca> wrote:
> 
>> But that's how the current font-lock and indentation work: they never
>> look beyond the narrowing limits.
> 
> Not quite: that's true for indentation, but for font-lock we have
> `font-lock-dont-widen` (i.e. by default, font-lock widens temporarily
> while it does its job).
> 
> For TS, given the cost associated with changing the bounds, I think it
> would make a lot of sense to ignore narrowing (and maybe provide some
> separate way to specify bounds, for the rare cases like Info and Rmail
> where a buffer contains "a collection of things" and we only want to
> parse/manipulate one of those things at any given time).

Tree-sitter lets you set ranges in which the parser works in. That’s how they support multi-language files like html+javascript+css. This will certainly work for Rmail and Info, too.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-23 13:25                                                     ` Yuan Fu
@ 2021-07-23 19:10                                                       ` Eli Zaretskii
  2021-07-23 20:01                                                         ` Perry E. Metzger
  2021-07-23 20:22                                                         ` Yuan Fu
  0 siblings, 2 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-23 19:10 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Fri, 23 Jul 2021 09:25:17 -0400
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  emacs-devel@gnu.org
> 
> > So what does TS do if it attempts to allocate more memory and that
> > fails?  Regardless, we'd need some fallback strategy, because AFAIU
> > many people run with VM overcommit enabled, so the OOM killer will
> > just kill the Emacs process when it asks for too much memory.
> 
> Abort, it seems:
> 
> static inline void *ts_malloc_default(size_t size) {
>   void *result = malloc(size);
>   if (size > 0 && !result) {
>     fprintf(stderr, "tree-sitter failed to allocate %zu bytes", size);
>     exit(1);
>   }
>   return result;
> }

We must replace this function, if only because the MS-Windows build of
Emacs uses a custom malloc implementation.  Does TS allow the client
to use its own malloc?

> >> Also, we don’t call change hooks in replace_range_2, why?
> > 
> > Because it is called in a loop, one character at a time.  The caller
> > of replace_range_2 calls these hooks for the entire region, once.
> > 
> >> Should I update tree-sitter trees in that function, or should I not?
> > 
> > The only caller is casify_region, so you could update there.
> 
> casify_region doesn’t have access to byte positions.

You can compute them using CHAR_TO_BYTE.

> I’ll leave it as-is, recording change in replace_range_2, if you don’t object to it.

That'd be wasteful, I think.  replace_range_2 is called one character
at a time.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-23 14:07                                                 ` Stefan Monnier
  2021-07-23 14:45                                                   ` Yuan Fu
@ 2021-07-23 19:13                                                   ` Eli Zaretskii
  2021-07-23 20:28                                                     ` Stefan Monnier
  1 sibling, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-23 19:13 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: casouri, cpitclaudel, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: Yuan Fu <casouri@gmail.com>,  cpitclaudel@gmail.com,  emacs-devel@gnu.org
> Date: Fri, 23 Jul 2021 10:07:42 -0400
> 
> > But that's how the current font-lock and indentation work: they never
> > look beyond the narrowing limits.
> 
> Not quite: that's true for indentation, but for font-lock we have
> `font-lock-dont-widen` (i.e. by default, font-lock widens temporarily
> while it does its job).

jit-lock never requests fontifications outside of the accessible
portion, because redisplay doesn't look there.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-23 19:10                                                       ` Eli Zaretskii
@ 2021-07-23 20:01                                                         ` Perry E. Metzger
  2021-07-24  5:52                                                           ` Eli Zaretskii
  2021-07-23 20:22                                                         ` Yuan Fu
  1 sibling, 1 reply; 284+ messages in thread
From: Perry E. Metzger @ 2021-07-23 20:01 UTC (permalink / raw)
  To: emacs-devel

On 7/23/21 15:10, Eli Zaretskii wrote:

>> Abort, it seems:
>> static inline void *ts_malloc_default(size_t size) {
>>    void *result = malloc(size);
>>    if (size > 0 && !result) {
>>      fprintf(stderr, "tree-sitter failed to allocate %zu bytes", size);
>>      exit(1);
>>    }
>>    return result;
>> }
> We must replace this function, if only because the MS-Windows build of
> Emacs uses a custom malloc implementation.  Does TS allow the client
> to use its own malloc?

Certainly more graceful allocation error behavior would be necessary in 
an Emacs context even on Unix-like operating systems. An unexpected hard 
exit could result in loss of data for the user.

Perry





^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-23 19:10                                                       ` Eli Zaretskii
  2021-07-23 20:01                                                         ` Perry E. Metzger
@ 2021-07-23 20:22                                                         ` Yuan Fu
  2021-07-24  6:00                                                           ` Eli Zaretskii
  2021-07-24 15:04                                                           ` Yuan Fu
  1 sibling, 2 replies; 284+ messages in thread
From: Yuan Fu @ 2021-07-23 20:22 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Clément Pit-Claudel, Stefan Monnier, emacs-devel

> 
> We must replace this function, if only because the MS-Windows build of
> Emacs uses a custom malloc implementation.  Does TS allow the client
> to use its own malloc?

Yes, in that case, we need to embed tree-sitter into Emacs, instead of using it as a dynamic library, I think.

// Allow clients to override allocation functions
#ifndef ts_malloc
#define ts_malloc  ts_malloc_default
#endif
#ifndef ts_calloc
#define ts_calloc  ts_calloc_default
#endif
#ifndef ts_realloc
#define ts_realloc ts_realloc_default
#endif
#ifndef ts_free
#define ts_free    ts_free_default
#endif

How do we handle such thing in Emacs?

> 
>>>> Also, we don’t call change hooks in replace_range_2, why?
>>> 
>>> Because it is called in a loop, one character at a time.  The caller
>>> of replace_range_2 calls these hooks for the entire region, once.
>>> 
>>>> Should I update tree-sitter trees in that function, or should I not?
>>> 
>>> The only caller is casify_region, so you could update there.
>> 
>> casify_region doesn’t have access to byte positions.
> 
> You can compute them using CHAR_TO_BYTE.

Ok. I’ll do that.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-23 19:13                                                   ` Eli Zaretskii
@ 2021-07-23 20:28                                                     ` Stefan Monnier
  2021-07-24  6:02                                                       ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Stefan Monnier @ 2021-07-23 20:28 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: casouri, cpitclaudel, emacs-devel

>> > But that's how the current font-lock and indentation work: they never
>> > look beyond the narrowing limits.
>> Not quite: that's true for indentation, but for font-lock we have
>> `font-lock-dont-widen` (i.e. by default, font-lock widens temporarily
>> while it does its job).
> jit-lock never requests fontifications outside of the accessible
> portion, because redisplay doesn't look there.

But font-lock may look (and fontify) beyond the narrowing, and
when it calls `syntax-ppss` it will usually parse from 1 rather than
from `point-min`.

I'd expect jit/font-lock running on top of TS to behave similarly: the
actual parsing is done over the widened buffer but the fontification is
only applied to the visible part (or nearby).


        Stefan




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-21 19:37                             ` Eli Zaretskii
@ 2021-07-24  2:00                               ` Stephen Leake
  2021-07-24  6:51                                 ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Stephen Leake @ 2021-07-24  2:00 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: casouri, monnier, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Stephen Leake <stephen_leake@stephe-leake.org>
>> Cc: casouri@gmail.com,  monnier@iro.umontreal.ca,  emacs-devel@gnu.org
>> Date: Wed, 21 Jul 2021 08:49:15 -0700
>> 
>> > I fail to see the significance of the difference.  Surely, you could
>> > hand it a block of text with changes to mean that this block replaces
>> > the previous version of that block.  It might take the parser more
>> > work to update the parse tree in this case, but if it's fast enough,
>> > that won't be the problem.  Right?
>> 
>> tree-sitter doesn't store the previous text, so there's nothing to
>> compare it to.
>
> There was nothing about comparison in my text.  You tell TS that
> editing replaced a block of text between A and B with block between A
> and C, without revealing the fine-grained changes inside that block.
> This must work, because editing could indeed do just that.

I see; treat the whole block as one change. Yes, that would work, but it
would probably be less optimal than sending a list of smaller changes;
depends on the details.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-21 19:43                                             ` Eli Zaretskii
@ 2021-07-24  2:57                                               ` Stephen Leake
  2021-07-24  3:39                                                 ` Óscar Fuentes
  2021-07-24  7:06                                                 ` Eli Zaretskii
  2021-07-24  3:55                                               ` Clément Pit-Claudel
  1 sibling, 2 replies; 284+ messages in thread
From: Stephen Leake @ 2021-07-24  2:57 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Clément Pit-Claudel, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Clément Pit-Claudel <cpitclaudel@gmail.com>
>> Date: Wed, 21 Jul 2021 12:54:16 -0400
>>
>> On 7/21/21 12:29 PM, Stephen Leake wrote:
>> > Yes, for both tree-sitter and wisi. wisi can take even longer if lots of
>> > error correction is required (I have a time-out set at 5 seconds). But
>> > that happens when the file is first opened; I doubt any user would start
>> > typing that fast. I know I typically take a while to just look at the
>> > text, and then navigate to the point of interest.
>>
>> I'm not sure.  We've had significant complaint in Flycheck for freezing Emacs for <1s
>
> How much "less"?  Close to 1 sec is indeed annoying, but 20 msec or so
> should be bearable.
>
> You seem to assume up front that TS (re)-parsing will take 1 sec, but
> AFAIK there's no reason to assume such bad performance.

This is for the initial parse, on a large file. No matter how fast the
parser is, I can give you a file that takes one second to parse, and
some user will have such a file (the work always expands to consume all
the resources available).

I just got incremental parse working well enough to measure it; in the
largest Ada file I have (10,000 lines from Eurocontrol):

initial parse:       1.539319 seconds
re-indent two lines: 0.038999 seconds

39 milliseconds for re-indent is just slow enough to be noticeable; I still
have algorithms to convert to be as incremental as possible.

The initial parse includes sending the full file text to the external
process over a pipe. Parsing that same large file with the command-line
parser (no emacs involved; file is memory-mapped) takes only 0.190
seconds, so there is lots of room for optimization - moving to a module
with direct access to the emacs buffer should do a lot.

In a very small file:

initial   0.000632 seconds
re-indent 0.000942 seconds

Easily fast enough to keep up with the user.

I don't have a direct comparison of tree-sitter and wisi parsing the
same file; I'll have to see if I can set that up.

--
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24  2:57                                               ` Stephen Leake
@ 2021-07-24  3:39                                                 ` Óscar Fuentes
  2021-07-24  7:34                                                   ` Eli Zaretskii
  2021-07-25 16:49                                                   ` Stephen Leake
  2021-07-24  7:06                                                 ` Eli Zaretskii
  1 sibling, 2 replies; 284+ messages in thread
From: Óscar Fuentes @ 2021-07-24  3:39 UTC (permalink / raw)
  To: emacs-devel

Stephen Leake <stephen_leake@stephe-leake.org> writes:

> 39 milliseconds for re-indent is just slow enough to be noticeable; I still
> have algorithms to convert to be as incremental as possible.

[snip]

> In a very small file:
>
> initial   0.000632 seconds
> re-indent 0.000942 seconds
>
> Easily fast enough to keep up with the user.

Doing work every time the user changes the file is not always a good
thing. Nowadays the user doesn't just expect automatic indentation, he
wants code formatting too, which means splitting, fusing and inserting
lines, plus moving chunks of code left and right. Doing that every time
a character is added or deleted can be visually confusing due to chunks
of text changing positions as you type, so the systems I know are
triggered by certain events (like the insertion of characters that mark
the end of statements). Then they analyze the code and, if it is well
formed, apply the reformatting. Something similar could be said about
fontification and other tasks.

In my experience, delays of 0.1 seconds are perfectly acceptable with
this method.

So I'll insist on not obsessing too much about performance. Implement
something simple, see if it is usable. If not, invest effort on
optimizations until it is good enough.




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-21 19:43                                             ` Eli Zaretskii
  2021-07-24  2:57                                               ` Stephen Leake
@ 2021-07-24  3:55                                               ` Clément Pit-Claudel
  1 sibling, 0 replies; 284+ messages in thread
From: Clément Pit-Claudel @ 2021-07-24  3:55 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

On 7/21/21 3:43 PM, Eli Zaretskii wrote:
>> From: Clément Pit-Claudel <cpitclaudel@gmail.com>
>> I'm not sure.  We've had significant complaint in Flycheck for freezing Emacs for <1s
> 
> How much "less"?  Close to 1 sec is indeed annoying, but 20 msec or so
> should be bearable.

Indeed, for us the freeze is only in when the buffer is first open, so 20ms is fine; the cases we had complains about where close to 1s, maybe .8s (and in some cases significantly more, too).

> You seem to assume up front that TS (re)-parsing will take 1 sec, but
> AFAIK there's no reason to assume such bad performance.

I expect/hope re-parsing will be much faster.  For the initial parse, I was going from numbers that were given earlier in this thread.




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-23 20:01                                                         ` Perry E. Metzger
@ 2021-07-24  5:52                                                           ` Eli Zaretskii
  0 siblings, 0 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-24  5:52 UTC (permalink / raw)
  To: Perry E. Metzger; +Cc: emacs-devel

> Date: Fri, 23 Jul 2021 16:01:14 -0400
> From: "Perry E. Metzger" <perry@piermont.com>
> 
> On 7/23/21 15:10, Eli Zaretskii wrote:
> 
> >> Abort, it seems:
> >> static inline void *ts_malloc_default(size_t size) {
> >>    void *result = malloc(size);
> >>    if (size > 0 && !result) {
> >>      fprintf(stderr, "tree-sitter failed to allocate %zu bytes", size);
> >>      exit(1);
> >>    }
> >>    return result;
> >> }
> > We must replace this function, if only because the MS-Windows build of
> > Emacs uses a custom malloc implementation.  Does TS allow the client
> > to use its own malloc?
> 
> Certainly more graceful allocation error behavior would be necessary in 
> an Emacs context even on Unix-like operating systems. An unexpected hard 
> exit could result in loss of data for the user.

Sure, which is why this must be replaced.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-23 20:22                                                         ` Yuan Fu
@ 2021-07-24  6:00                                                           ` Eli Zaretskii
  2021-07-25 18:01                                                             ` Stephen Leake
  2021-07-24 15:04                                                           ` Yuan Fu
  1 sibling, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-24  6:00 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Fri, 23 Jul 2021 16:22:59 -0400
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  emacs-devel@gnu.org
> 
> > We must replace this function, if only because the MS-Windows build of
> > Emacs uses a custom malloc implementation.  Does TS allow the client
> > to use its own malloc?
> 
> Yes, in that case, we need to embed tree-sitter into Emacs, instead of using it as a dynamic library, I think.
> 
> // Allow clients to override allocation functions
> #ifndef ts_malloc
> #define ts_malloc  ts_malloc_default
> #endif
> #ifndef ts_calloc
> #define ts_calloc  ts_calloc_default
> #endif
> #ifndef ts_realloc
> #define ts_realloc ts_realloc_default
> #endif
> #ifndef ts_free
> #define ts_free    ts_free_default
> #endif
> 
> How do we handle such thing in Emacs?

We use xmalloc, which calls memory_full when allocation fails, which
releases some spare memory we have for this purpose, and tells the
user to save the session and exit.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-23 20:28                                                     ` Stefan Monnier
@ 2021-07-24  6:02                                                       ` Eli Zaretskii
  2021-07-24 14:19                                                         ` Stefan Monnier
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-24  6:02 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: casouri, cpitclaudel, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: casouri@gmail.com,  cpitclaudel@gmail.com,  emacs-devel@gnu.org
> Date: Fri, 23 Jul 2021 16:28:16 -0400
> 
> >> > But that's how the current font-lock and indentation work: they never
> >> > look beyond the narrowing limits.
> >> Not quite: that's true for indentation, but for font-lock we have
> >> `font-lock-dont-widen` (i.e. by default, font-lock widens temporarily
> >> while it does its job).
> > jit-lock never requests fontifications outside of the accessible
> > portion, because redisplay doesn't look there.
> 
> But font-lock may look (and fontify) beyond the narrowing, and
> when it calls `syntax-ppss` it will usually parse from 1 rather than
> from `point-min`.

Yes, and that's why I said that callers should call 'widen' if they
need to do so.

The question is what should TS reading do _by_default_.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24  2:00                               ` Stephen Leake
@ 2021-07-24  6:51                                 ` Eli Zaretskii
  2021-07-25 16:16                                   ` Stephen Leake
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-24  6:51 UTC (permalink / raw)
  To: Stephen Leake; +Cc: casouri, monnier, emacs-devel

> From: Stephen Leake <stephen_leake@stephe-leake.org>
> Cc: casouri@gmail.com,  monnier@iro.umontreal.ca,  emacs-devel@gnu.org
> Date: Fri, 23 Jul 2021 19:00:12 -0700
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> > I fail to see the significance of the difference.  Surely, you could
> >> > hand it a block of text with changes to mean that this block replaces
> >> > the previous version of that block.  It might take the parser more
> >> > work to update the parse tree in this case, but if it's fast enough,
> >> > that won't be the problem.  Right?
> >> 
> >> tree-sitter doesn't store the previous text, so there's nothing to
> >> compare it to.
> >
> > There was nothing about comparison in my text.  You tell TS that
> > editing replaced a block of text between A and B with block between A
> > and C, without revealing the fine-grained changes inside that block.
> > This must work, because editing could indeed do just that.
> 
> I see; treat the whole block as one change. Yes, that would work, but it
> would probably be less optimal than sending a list of smaller changes;
> depends on the details.

Since TS is very fast, I think this sub-optimality will not cause any
tangible performance issues in Emacs.  And from our POV it is a good
optimization because it will minimize (and to some extent optimize)
the traffic between Emacs and TS.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24  2:57                                               ` Stephen Leake
  2021-07-24  3:39                                                 ` Óscar Fuentes
@ 2021-07-24  7:06                                                 ` Eli Zaretskii
  2021-07-25 17:48                                                   ` Stephen Leake
  1 sibling, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-24  7:06 UTC (permalink / raw)
  To: Stephen Leake; +Cc: cpitclaudel, emacs-devel

> From: Stephen Leake <stephen_leake@stephe-leake.org>
> Cc: Clément Pit-Claudel <cpitclaudel@gmail.com>,
>   emacs-devel@gnu.org
> Date: Fri, 23 Jul 2021 19:57:32 -0700
> 
> > How much "less"?  Close to 1 sec is indeed annoying, but 20 msec or so
> > should be bearable.
> >
> > You seem to assume up front that TS (re)-parsing will take 1 sec, but
> > AFAIK there's no reason to assume such bad performance.
> 
> This is for the initial parse, on a large file. No matter how fast the
> parser is, I can give you a file that takes one second to parse, and
> some user will have such a file (the work always expands to consume all
> the resources available).

That problem is already with us: if I visit xdisp.c in an unoptimized
build of Emacs 28, I wait almost 4 sec for the first window-full to be
displayed.  (It's more like 0.5 sec in an optimized build of Emacs
27.2.)  So the real question is how much using TS will _improve_ the
situation.

> I just got incremental parse working well enough to measure it; in the
> largest Ada file I have (10,000 lines from Eurocontrol):
> 
> initial parse:       1.539319 seconds
> re-indent two lines: 0.038999 seconds
> 
> 39 milliseconds for re-indent is just slow enough to be noticeable; I still
> have algorithms to convert to be as incremental as possible.

For comparison, how much does re-indentation of 2 lines take in Emacs
without a parser?

39 msec might be noticeable, but it isn't annoying; anything below 50
msec isn't.  Try "C-x TAB" in Emacs on 10-line block of text, and you
get more than that.  So if you consider that time a problem, it is
here already as well.

> The initial parse includes sending the full file text to the external
> process over a pipe.

So the above results are with wisi.  We need timings with TS to see
the results that really matter for this discussion.

> I don't have a direct comparison of tree-sitter and wisi parsing the
> same file; I'll have to see if I can set that up.

Please do.  Otherwise we are comparing apples with oranges.  They are
all fruit, but still...

Thanks.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24  3:39                                                 ` Óscar Fuentes
@ 2021-07-24  7:34                                                   ` Eli Zaretskii
  2021-07-25 16:49                                                   ` Stephen Leake
  1 sibling, 0 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-24  7:34 UTC (permalink / raw)
  To: Óscar Fuentes; +Cc: emacs-devel

> From: Óscar Fuentes <ofv@wanadoo.es>
> Date: Sat, 24 Jul 2021 05:39:09 +0200
> 
> So I'll insist on not obsessing too much about performance. Implement
> something simple, see if it is usable. If not, invest effort on
> optimizations until it is good enough.

Agreed.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-22 13:47                                             ` Yuan Fu
  2021-07-22 14:11                                               ` Óscar Fuentes
  2021-07-22 17:00                                               ` Eli Zaretskii
@ 2021-07-24  9:33                                               ` Stephen Leake
  2021-07-24 22:54                                                 ` Dmitry Gutov
  2 siblings, 1 reply; 284+ messages in thread
From: Stephen Leake @ 2021-07-24  9:33 UTC (permalink / raw)
  To: Yuan Fu
  Cc: Eli Zaretskii, emacs-devel, Clément Pit-Claudel, Stefan Monnier

Yuan Fu <casouri@gmail.com> writes:

>> 
>>> +  // TODO BUF_ZV_BYTE?
>> 
>> Do you want to discuss this?  I'd prefer to have it the other way
>> around: use BUF_ZV_BYTE by default.  The callers could widen the
>> buffer if they needed to access outside of the narrowing.
>
> Yes, I meant to discuss this. The problem with respecting narrowing is
> that, a user can freely narrow and widen arbitrarily, and Emacs needs
> to translate them into insertion & deletion of the buffer text for
> tree-sitter, every time a user narrows or widens the buffer. 

I don't think that's the right thing to do. tree-sitter should always
have a tree that represents the entire buffer; if the user narrows,
edits will only affect the narrowed region, but tree-sitter won't notice
that, and won't care.

In particular, that means buffer positions reported by tree-sitter will
match emacs buffer positions.

> Plus, if tree-sitter respects narrowing, it could happen where a user
> narrows the buffer, the font-locking changes and is not correct
> anymore. Maybe that’s not the user want. 

Exactly. The indent will be wrong, too, if narrowing excludes a
containing block.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-22 19:29                                                   ` Óscar Fuentes
  2021-07-23  5:21                                                     ` Eli Zaretskii
@ 2021-07-24  9:38                                                     ` Stephen Leake
  1 sibling, 0 replies; 284+ messages in thread
From: Stephen Leake @ 2021-07-24  9:38 UTC (permalink / raw)
  To: Óscar Fuentes; +Cc: emacs-devel

Óscar Fuentes <ofv@wanadoo.es> writes:

> Eli Zaretskii <eliz@gnu.org> writes:
>> (b) having a feature that fails before you
>> reach the max size of a buffer Emacs supports is a problem, because it
>> will cause hard-to-deal-with problems.
>
> We can put reasonable limits on when to use TS once we have some
> experience with it. What matters right now is if TS would be usable for
> the typical use case, and I guess the answer is positive. Also, it is
> not as if we had other options to consider.

wisi supports > 4G, not that I've actually tried it. And incremental
parse is now working well enough to benchmark, in my devel version.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-22 17:00                                               ` Eli Zaretskii
  2021-07-22 17:47                                                 ` Yuan Fu
  2021-07-23 14:07                                                 ` Stefan Monnier
@ 2021-07-24  9:42                                                 ` Stephen Leake
  2021-07-24 11:22                                                   ` Eli Zaretskii
  2 siblings, 1 reply; 284+ messages in thread
From: Stephen Leake @ 2021-07-24  9:42 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Yuan Fu, emacs-devel, cpitclaudel, monnier

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Yuan Fu <casouri@gmail.com>
>> 
>> Yes, I meant to discuss this. The problem with respecting narrowing
>> is that, a user can freely narrow and widen arbitrarily, and Emacs
>> needs to translate them into insertion & deletion of the buffer text
>> for tree-sitter, every time a user narrows or widens the buffer.
>> Plus, if tree-sitter respects narrowing, it could happen where a
>> user narrows the buffer, the font-locking changes and is not correct
>> anymore. Maybe that’s not the user want. Also, if someone narrows
>> and widens often, maybe narrow to a function for better focus,
>> tree-sitter needs to constantly re-parse most of the buffer. These
>> are not significant disadvantages, but what do we get from
>> respecting narrowing that justifies code complexity and these small
>> annoyances?
>
> But that's how the current font-lock and indentation work: they never
> look beyond the narrowing limits.  

And that's broken, unless the narrowing is for multi-major-mode.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24  9:42                                                 ` Stephen Leake
@ 2021-07-24 11:22                                                   ` Eli Zaretskii
  2021-07-25 18:21                                                     ` Stephen Leake
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-24 11:22 UTC (permalink / raw)
  To: Stephen Leake; +Cc: casouri, emacs-devel, cpitclaudel, monnier

> From: Stephen Leake <stephen_leake@stephe-leake.org>
> Cc: Yuan Fu <casouri@gmail.com>,  cpitclaudel@gmail.com,
>   monnier@iro.umontreal.ca,  emacs-devel@gnu.org
> Date: Sat, 24 Jul 2021 02:42:24 -0700
> 
> > But that's how the current font-lock and indentation work: they never
> > look beyond the narrowing limits.  
> 
> And that's broken

??? Of course, it isn't: it's how Emacs has worked since v21.1.

> unless the narrowing is for multi-major-mode.

And what would you do in that case, if you allow TS to look beyond the
restriction?



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-17 17:54                                       ` Eli Zaretskii
@ 2021-07-24 14:08                                         ` Stefan Monnier
  2021-07-24 14:32                                           ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Stefan Monnier @ 2021-07-24 14:08 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, emacs-devel

>> If we copy the buffer's content to a freshly malloc area before passing
>> that to TS, then there should be no problem running TS in a separate
>> concurrent thread, indeed.
> Making a copy of the buffer is a non-starter from where I stand.  It
> doesn't scale, for starters.  I don't see any reason to go to such a
> complex design at this early stage.

I see absolutely no problem with scaling in making a copy: the extra
memory and CPU time taken by the copy will be a constant factor which
I don't expect to go much beyond 10%, which doesn't threaten scaling and
seems perfectly acceptable in return for being able to perform the
parse concurrently.

I'm not sure we'll want to do that, but I see no reason to consider it
a non-starter.

[ BTW, it's not clear to me if an update needs to be able to read the
  whole buffer or if it only needs access to the "update
  description".  ]


        Stefan




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24  6:02                                                       ` Eli Zaretskii
@ 2021-07-24 14:19                                                         ` Stefan Monnier
  0 siblings, 0 replies; 284+ messages in thread
From: Stefan Monnier @ 2021-07-24 14:19 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: casouri, cpitclaudel, emacs-devel

>> But font-lock may look (and fontify) beyond the narrowing, and
>> when it calls `syntax-ppss` it will usually parse from 1 rather than
>> from `point-min`.
> Yes, and that's why I said that callers should call 'widen' if they
> need to do so.
> The question is what should TS reading do _by_default_.

Ah, then we're in violent agreement.  The low-level interface with TS
should access the text within the narrowed region.  And the code that
calls TS will usually want to widen beforehand.


        Stefan




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24 14:08                                         ` Stefan Monnier
@ 2021-07-24 14:32                                           ` Eli Zaretskii
  2021-07-24 15:10                                             ` Stefan Monnier
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-24 14:32 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: cpitclaudel, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: cpitclaudel@gmail.com,  emacs-devel@gnu.org
> Date: Sat, 24 Jul 2021 10:08:58 -0400
> 
> >> If we copy the buffer's content to a freshly malloc area before passing
> >> that to TS, then there should be no problem running TS in a separate
> >> concurrent thread, indeed.
> > Making a copy of the buffer is a non-starter from where I stand.  It
> > doesn't scale, for starters.  I don't see any reason to go to such a
> > complex design at this early stage.
> 
> I see absolutely no problem with scaling in making a copy: the extra
> memory and CPU time taken by the copy will be a constant factor which
> I don't expect to go much beyond 10%

10% of what?  It will be 100% of all the buffers that need parsing.

> I'm not sure we'll want to do that, but I see no reason to consider it
> a non-starter.

It's a bad start, okay?

Anyway, it looks like nothing like that will be necessary,
fortunately.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-23 20:22                                                         ` Yuan Fu
  2021-07-24  6:00                                                           ` Eli Zaretskii
@ 2021-07-24 15:04                                                           ` Yuan Fu
  2021-07-24 15:48                                                             ` Eli Zaretskii
  2021-07-24 16:14                                                             ` Eli Zaretskii
  1 sibling, 2 replies; 284+ messages in thread
From: Yuan Fu @ 2021-07-24 15:04 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Clément Pit-Claudel, Stefan Monnier, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 2070 bytes --]

I wrote a simple interface between font-lock and tree-sitter, and it works pretty well: using tree-sitter for fontification, xdisp.c opens a lot faster, and scrolling through the buffer is also perceivably faster. My simple interface works like this: tree-sitter allow you to “pattern match” nodes in the parse tree with a DSL, and assign names to the matched nodes, e.g., given a pattern, you get back a list of (NAME . MATCHED-NODE). And if we use font-lock faces as names for those nodes, we get back a list of (FACE . MATCHED-NODE) from tree-sitter, and Emacs can simply look at the beginning and end of the node, and apply FACE to that range. For flexibility, FACE can also be a function, in which case the function is called with the node. This interface is basically what emacs-tree-sitter does (I don’t know if they allow a capture name to be a function, though.)

I have an example major-mode for C that uses tree-sitter for font-locking at the end of tree-sitter.el. 

Main functions to look at: tree-sitter-query-capture in tree_sitter.c, and tree-sitter-fontify-region-function in tree-sitter.el.

On the font-lock front, tree-sitter-fontify-region-function replaces font-lock-default-fontify-region, and tree-sitter-font-lock-settings replaces font-lock-defaults and font-lock-keywords. I should support font-lock-maximum-decoration but haven’t came up with a good way to do that. Maybe I should somehow reuse font-lock-defaults, and make it able to configure for tree-sitter font-locking? Apart from font-lock-maximum-decoration, what else should tree-sitter share with font-lock?

BTW, what is the best way to signal a lisp error from C? I tried xsignal2, signal_error, error and friends but they seem to crash Emacs. Maybe I wasn’t using them correctly.

IIUC if we want tree-sitter to use our malloc, we need to build it with Emacs, where should I put the source of tree-sitter?

What’s the different between make_string and make_pure_c_string? I’ve seen this “pure” thing else where, what does “pure” mean?

Yuan


[-- Attachment #2: ts.4.patch --]
[-- Type: application/octet-stream, Size: 36844 bytes --]

From d28e10e5905d244d92b71b74566c0bed80d5ed2b Mon Sep 17 00:00:00 2001
From: Yuan Fu <casouri@gmail.com>
Date: Sat, 24 Jul 2021 10:39:15 -0400
Subject: [PATCH] checkpoint 4

- Add font-locking
- Remove change-recording from replace_range_2, add to casify_region
---
 lisp/emacs-lisp/cl-preloaded.el |   2 +
 lisp/tree-sitter.el             | 276 ++++++++++++++++++++++++++
 src/casefiddle.c                |  12 ++
 src/insdel.c                    |  11 +-
 src/tree_sitter.c               | 332 +++++++++++++++++++++++++++++---
 src/tree_sitter.h               |   4 +-
 test/src/tree-sitter-tests.el   |  58 +++++-
 7 files changed, 655 insertions(+), 40 deletions(-)
 create mode 100644 lisp/tree-sitter.el

diff --git a/lisp/emacs-lisp/cl-preloaded.el b/lisp/emacs-lisp/cl-preloaded.el
index 7365e23186..2dccdff91a 100644
--- a/lisp/emacs-lisp/cl-preloaded.el
+++ b/lisp/emacs-lisp/cl-preloaded.el
@@ -68,6 +68,8 @@ cl--typeof-types
     (font-spec atom) (font-entity atom) (font-object atom)
     (vector array sequence atom)
     (user-ptr atom)
+    (tree-sitter-parser atom)
+    (tree-sitter-node atom)
     ;; Plus, really hand made:
     (null symbol list sequence atom))
   "Alist of supertypes.
diff --git a/lisp/tree-sitter.el b/lisp/tree-sitter.el
new file mode 100644
index 0000000000..a6ecb09386
--- /dev/null
+++ b/lisp/tree-sitter.el
@@ -0,0 +1,276 @@
+;;; tree-sitter.el --- tree-sitter utilities -*- lexical-binding: t -*-
+
+;; Copyright (C) 2021 Free Software Foundation, Inc.
+
+;; This file is part of GNU Emacs.
+
+;; GNU Emacs is free software: you can redistribute it and/or modify
+;; it under the terms of the GNU General Public License as published by
+;; the Free Software Foundation, either version 3 of the License, or
+;; (at your option) any later version.
+
+;; GNU Emacs is distributed in the hope that it will be useful,
+;; but WITHOUT ANY WARRANTY; without even the implied warranty of
+;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+;; GNU General Public License for more details.
+
+;; You should have received a copy of the GNU General Public License
+;; along with GNU Emacs.  If not, see <https://www.gnu.org/licenses/>.
+
+;;; Commentary:
+
+;;; Code:
+
+;;; Node & parser accessors
+
+(defun tree-sitter-node-buffer (node)
+  "Return the buffer in where NODE belongs."
+  (tree-sitter-parser-buffer
+   (tree-sitter-node-parser node)))
+
+;;; Parser API supplement
+
+(defun tree-sitter-get-parser (name)
+  "Find the first parser with name NAME in `tree-sitter-parser-list'.
+Return nil if we can't find any."
+  (catch 'found
+    (dolist (parser tree-sitter-parser-list)
+      (when (equal name (tree-sitter-parser-name parser))
+        (throw 'found parser)))))
+
+(defun tree-sitter-get-parser-create (name language)
+  "Find the first parser with name NAME in `tree-sitter-parser-list'.
+If none exists, create one and return it.  LANGUAGE is passed to
+`tree-sitter-create-parser' when creating the parser."
+  (or (tree-sitter-get-parser name)
+      (tree-sitter-create-parser (current-buffer) language name)))
+
+;;; Node API supplement
+
+(defun tree-sitter-node-beginning (node)
+  "Return the start position of NODE."
+  (byte-to-position (tree-sitter-node-start-byte node)))
+
+(defun tree-sitter-node-end (node)
+  "Return the end position of NODE."
+  (byte-to-position (tree-sitter-node-end-byte node)))
+
+(defun tree-sitter-node-in-range (beg end &optional parser-name named)
+  "Return the smallest node covering BEG to END.
+Find node in current buffer.  Return nil if none find.  If NAMED
+non-nil, only look for named node.  NAMED defaults to nil.  By
+default, use the first parser in `tree-sitter-parser-list'; but
+if PARSER-NAME is non-nil, it specifies the name of the parser that
+should be used."
+  (when-let ((root (tree-sitter-parser-root-node
+                    (if parser-name
+                        (tree-sitter-get-parser parser-name)
+                      (car tree-sitter-parser-list)))))
+    (tree-sitter-node-descendant-for-byte-range
+     root (position-bytes beg) (position-bytes end) named)))
+
+(defun tree-sitter-filter-child (node pred &optional named)
+  "Return children of NODE that satisfies PRED.
+PRED is a function that takes one argument, the child node.  If
+NAMED non-nil, only search named node.  NAMED defaults to nil."
+  (let ((child (tree-sitter-node-child node 0 named))
+        result)
+    (while child
+      (when (funcall pred child)
+        (push child result))
+      (setq child (tree-sitter-node-next-sibling child named)))
+    result))
+
+(defun tree-sitter-node-content (node)
+  "Return the buffer content corresponding to NODE."
+  (with-current-buffer (tree-sitter-node-buffer node)
+    (buffer-substring-no-properties
+     (tree-sitter-node-beginning node)
+     (tree-sitter-node-end node))))
+
+;;; Font-lock
+
+(defvar-local tree-sitter-font-lock-settings nil
+  "A list of settings for tree-sitter-based font-locking.
+
+Each setting controls one parser (often of different language).
+A settings is a list of form (NAME LANGUAGE PATTERN).  NAME is
+the name given to the parser, by convention it is
+\"font-lock-<language>\", where <language> is the language that
+the parser uses.  LANGUAGE is the language object returned by
+tree-sitter language dynamic modules.
+
+PATTERN is a tree-sitter query pattern. (See manual for how to
+write query patterns.)  This pattern should capture nodes with
+either face names or function names.  If captured with a face
+name, the node's corresponding text in the buffer is fontified
+with that face; if captured with a function name, the function is
+called with three arguments, BEG END NODE, where BEG and END
+marks the span of the corresponding text, and NODE is the node
+itself.")
+
+(defun tree-sitter-fontify-region-function (beg end &optional verbose)
+  "Fontify the region between BEG and END.
+If VERBOSE is non-nil, print status messages.
+\(See `font-lock-fontify-region-function'.)"
+  (dolist (elm tree-sitter-font-lock-settings)
+    (let ((parser-name (car elm))
+          (language (nth 1 elm))
+          (match-pattern (nth 2 elm)))
+      (tree-sitter-get-parser-create parser-name language)
+      (when-let ((node (tree-sitter-node-in-range beg end parser-name)))
+        (let ((captures (tree-sitter-query-capture
+                         node match-pattern
+                         ;; specifying the range is important. More
+                         ;; often than not, NODE will be the root
+                         ;; node, and if we don't specify the range,
+                         ;; we are basically querying the whole file.
+                         (position-bytes beg) (position-bytes end))))
+          (with-silent-modifications
+            (while captures
+              (let* ((face (caar captures))
+                     (node (cdar captures))
+                     (beg (tree-sitter-node-beginning node))
+                     (end (tree-sitter-node-end node)))
+                (cond ((facep face)
+                       (put-text-property beg end 'face face))
+                      ((functionp face)
+                       (funcall face beg end node)))
+
+                (if verbose
+                    (message "Fontifying text from %d to %d with %s"
+                             beg end face)))
+              (setq captures (cdr captures))))
+          `(jit-lock-bounds ,(tree-sitter-node-beginning node)
+                            . ,(tree-sitter-node-end node)))))))
+
+
+(define-derived-mode json-mode js-mode "JSON"
+  "Major mode for JSON documents."
+  (setq-local font-lock-fontify-region-function
+              #'tree-sitter-fontify-region-function)
+  (setq-local tree-sitter-font-lock-settings
+              `(("font-lock-json"
+                 ,(tree-sitter-json)
+                 "(string) @font-lock-string-face
+(true) @font-lock-constant-face
+(false) @font-lock-constant-face
+(null) @font-lock-constant-face"))))
+
+(defun ts-c-fontify-system-lib (beg end _)
+  (put-text-property beg (1+ beg) 'face 'font-lock-preprocessor-face)
+  (put-text-property (1- end) end 'face 'font-lock-preprocessor-face)
+  (put-text-property (1+ beg) (1- end)
+                     'face 'font-lock-string-face))
+
+(define-derived-mode ts-c-mode prog-mode "TS C"
+  "C mode with tree-sitter support."
+  (setq-local font-lock-fontify-region-function
+              #'tree-sitter-fontify-region-function)
+  (setq-local tree-sitter-font-lock-settings
+              `(("font-lock-c"
+                 ,(tree-sitter-c)
+                 "(null) @font-lock-constant-face
+(true) @font-lock-constant-face
+(false) @font-lock-constant-face
+
+(comment) @font-lock-comment-face
+
+(system_lib_string) @ts-c-fontify-system-lib
+
+(unary_expression
+  operator: _ @font-lock-negation-char-face)
+
+(string_literal) @font-lock-string-face
+(char_literal) @font-lock-string-face
+
+
+
+(function_definition
+  declarator: (identifier) @font-lock-function-name-face)
+
+(declaration
+  declarator: (identifier) @font-lock-function-name-face)
+
+(function_declarator
+ declarator: (identifier) @font-lock-function-name-face)
+
+
+
+(init_declarator
+ declarator: (identifier) @font-lock-variable-name-face)
+
+(parameter_declaration
+ declarator: (identifier) @font-lock-variable-name-face)
+
+(preproc_def
+ name: (identifier) @font-lock-variable-name-face)
+
+(enumerator
+ name: (identifier) @font-lock-variable-name-face)
+
+(field_identifier) @font-lock-variable-name-face
+
+(parameter_list
+ (parameter_declaration
+  (identifier) @font-lock-variable-name-face))
+
+(pointer_declarator
+ declarator: (identifier) @font-lock-variable-name-face)
+
+(array_declarator
+ declarator: (identifier) @font-lock-variable-name-face)
+
+(preproc_function_def
+ name: (identifier) @font-lock-variable-name-face
+ parameters: (preproc_params
+              (identifier) @font-lock-variable-name-face))
+
+
+
+(type_identifier) @font-lock-type-face
+(primitive_type) @font-lock-type-face
+
+\"auto\" @font-lock-keyword-face
+\"break\" @font-lock-keyword-face
+\"case\" @font-lock-keyword-face
+\"const\" @font-lock-keyword-face
+\"continue\" @font-lock-keyword-face
+\"default\" @font-lock-keyword-face
+\"do\" @font-lock-keyword-face
+\"else\" @font-lock-keyword-face
+\"enum\" @font-lock-keyword-face
+\"extern\" @font-lock-keyword-face
+\"for\" @font-lock-keyword-face
+\"goto\" @font-lock-keyword-face
+\"if\" @font-lock-keyword-face
+\"register\" @font-lock-keyword-face
+\"return\" @font-lock-keyword-face
+\"sizeof\" @font-lock-keyword-face
+\"static\" @font-lock-keyword-face
+\"struct\" @font-lock-keyword-face
+\"switch\" @font-lock-keyword-face
+\"typedef\" @font-lock-keyword-face
+\"union\" @font-lock-keyword-face
+\"volatile\" @font-lock-keyword-face
+\"while\" @font-lock-keyword-face
+
+\"long\" @font-lock-type-face
+\"short\" @font-lock-type-face
+\"signed\" @font-lock-type-face
+\"unsigned\" @font-lock-type-face
+
+\"#include\" @font-lock-preprocessor-face
+\"#define\" @font-lock-preprocessor-face
+\"#ifdef\" @font-lock-preprocessor-face
+\"#ifndef\" @font-lock-preprocessor-face
+\"#endif\" @font-lock-preprocessor-face
+\"#else\" @font-lock-preprocessor-face
+\"#elif\" @font-lock-preprocessor-face"))))
+
+(add-to-list 'auto-mode-alist '("\\.json\\'" . json-mode))
+(add-to-list 'auto-mode-alist '("\\.tsc\\'" . ts-c-mode))
+
+(provide 'tree-sitter)
+
+;;; tree-sitter.el ends here
diff --git a/src/casefiddle.c b/src/casefiddle.c
index a7a2541490..42cd2fdd28 100644
--- a/src/casefiddle.c
+++ b/src/casefiddle.c
@@ -30,6 +30,10 @@ Copyright (C) 1985, 1994, 1997-1999, 2001-2021 Free Software Foundation,
 #include "composite.h"
 #include "keymap.h"
 
+#ifdef HAVE_TREE_SITTER
+#include "tree_sitter.h"
+#endif
+
 enum case_action {CASE_UP, CASE_DOWN, CASE_CAPITALIZE, CASE_CAPITALIZE_UP};
 
 /* State for casing individual characters.  */
@@ -495,6 +499,11 @@ casify_region (enum case_action flag, Lisp_Object b, Lisp_Object e)
   modify_text (start, end);
   prepare_casing_context (&ctx, flag, true);
 
+#ifdef HAVE_TREE_SITTER
+  ptrdiff_t start_byte = CHAR_TO_BYTE (start);
+  ptrdiff_t old_end_byte = CHAR_TO_BYTE (end);
+#endif
+
   ptrdiff_t orig_end = end;
   record_delete (start, make_buffer_string (start, end, true), false);
   if (NILP (BVAR (current_buffer, enable_multibyte_characters)))
@@ -513,6 +522,9 @@ casify_region (enum case_action flag, Lisp_Object b, Lisp_Object e)
     {
       signal_after_change (start, end - start - added, end - start);
       update_compositions (start, end, CHECK_ALL);
+#ifdef HAVE_TREE_SITTER
+      ts_record_change (start_byte, old_end_byte, CHAR_TO_BYTE (end));
+#endif
     }
 
   return orig_end + added;
diff --git a/src/insdel.c b/src/insdel.c
index b313c50cda..3dfc281b49 100644
--- a/src/insdel.c
+++ b/src/insdel.c
@@ -1592,7 +1592,11 @@ replace_range (ptrdiff_t from, ptrdiff_t to, Lisp_Object new,
    If MARKERS, relocate markers.
 
    Unlike most functions at this level, never call
-   prepare_to_modify_buffer and never call signal_after_change.  */
+   prepare_to_modify_buffer and never call signal_after_change.
+   Because this function is called in a loop, one character at a time.
+   The caller of 'replace_range_2' calls these hooks for the entire
+   region once.  Apart from signal_after_change, any caller of this
+   function should also call ts_record_change.  */
 
 void
 replace_range_2 (ptrdiff_t from, ptrdiff_t from_byte,
@@ -1705,11 +1709,6 @@ replace_range_2 (ptrdiff_t from, ptrdiff_t from_byte,
 
   modiff_incr (&MODIFF);
   CHARS_MODIFF = MODIFF;
-
-#ifdef HAVE_TREE_SITTER
-  ts_record_change (from_byte, to_byte, from_byte + insbytes);
-#endif
-
 }
 \f
 /* Delete characters in current buffer
diff --git a/src/tree_sitter.c b/src/tree_sitter.c
index a6a8912c84..e9f8ddc7e3 100644
--- a/src/tree_sitter.c
+++ b/src/tree_sitter.c
@@ -35,6 +35,8 @@ Copyright (C) 2021 Free Software Foundation, Inc.
 /* parser.h defines a macro ADVANCE that conflicts with alloc.c.  */
 #include <tree_sitter/parser.h>
 
+/*** Functions related to parser and node object.  */
+
 DEFUN ("tree-sitter-parser-p",
        Ftree_sitter_parser_p, Stree_sitter_parser_p, 1, 1, 0,
        doc: /* Return t if OBJECT is a tree-sitter parser.  */)
@@ -57,6 +59,8 @@ DEFUN ("tree-sitter-node-p",
     return Qnil;
 }
 
+/*** Parsing functions */
+
 /* Update each parser's tree after the user made an edit.  This
 function does not parse the buffer and only updates the tree. (So it
 should be very fast.)  */
@@ -77,7 +81,6 @@ ts_record_change (ptrdiff_t start_byte, ptrdiff_t old_end_byte,
       XTS_PARSER (lisp_parser)->need_reparse = true;
       parser_list = Fcdr (parser_list);
     }
-
 }
 
 /* Parse the buffer.  We don't parse until we have to. When we have
@@ -91,9 +94,19 @@ ts_ensure_parsed (Lisp_Object parser)
   TSTree *tree = XTS_PARSER(parser)->tree;
   TSInput input = XTS_PARSER (parser)->input;
   TSTree *new_tree = ts_parser_parse(ts_parser, tree, input);
+  /* This should be very rare: it only happens when 1) language is not
+     set (impossible in Emacs because the user has to supply a
+     language to create a parser), 2) parse canceled due to timeout
+     (impossible because we don't set a timeout), 3) parse canceled
+     due to cancellation flag (impossible because we don't set the
+     flag).  (See comments for ts_parser_parse in
+     tree_sitter/api.h.)  */
+  if (new_tree == NULL)
+    signal_error ("Parse failed", parser);
   ts_tree_delete (tree);
   XTS_PARSER (parser)->tree = new_tree;
   XTS_PARSER (parser)->need_reparse = false;
+  TSNode node = ts_tree_root_node (new_tree);
 }
 
 /* This is the read function provided to tree-sitter to read from a
@@ -103,9 +116,6 @@ ts_ensure_parsed (Lisp_Object parser)
 ts_read_buffer (void *buffer, uint32_t byte_index,
 		TSPoint position, uint32_t *bytes_read)
 {
-  if (!BUFFER_LIVE_P ((struct buffer *) buffer))
-    error ("BUFFER is not live");
-
   ptrdiff_t byte_pos = byte_index + 1;
 
   /* Read one character.  Tree-sitter wants us to set bytes_read to 0
@@ -114,8 +124,17 @@ ts_read_buffer (void *buffer, uint32_t byte_index,
      string.  */
   char *beg;
   int len;
+  /* This function could run from a user command, so it is better to
+     do nothing instead of raising an error. (It was a pain in the a**
+     to read mega-if-conditions in Emacs source, so I write the two
+     branches separately, hoping the compiler can merge them.)  */
+  if (!BUFFER_LIVE_P ((struct buffer *) buffer))
+    {
+      beg = "";
+      len = 0;
+    }
   // TODO BUF_ZV_BYTE?
-  if (byte_pos >= BUF_Z_BYTE ((struct buffer *) buffer))
+  else if (byte_pos >= BUF_Z_BYTE ((struct buffer *) buffer))
     {
       beg = "";
       len = 0;
@@ -123,19 +142,23 @@ ts_read_buffer (void *buffer, uint32_t byte_index,
   else
     {
       beg = (char *) BUF_BYTE_ADDRESS (buffer, byte_pos);
-      len = next_char_len(byte_pos);
+      len = BYTES_BY_CHAR_HEAD ((int) beg);
     }
   *bytes_read = (uint32_t) len;
 
   return beg;
 }
 
+/*** Creators and accessors for parser and node */
+
 /* Wrap the parser in a Lisp_Object to be used in the Lisp machine.  */
 Lisp_Object
-make_ts_parser (struct buffer *buffer, TSParser *parser, TSTree *tree)
+make_ts_parser (struct buffer *buffer, TSParser *parser,
+		TSTree *tree, Lisp_Object name)
 {
   struct Lisp_TS_Parser *lisp_parser
-    = ALLOCATE_PLAIN_PSEUDOVECTOR (struct Lisp_TS_Parser, PVEC_TS_PARSER);
+    = ALLOCATE_PSEUDOVECTOR (struct Lisp_TS_Parser, name, PVEC_TS_PARSER);
+  lisp_parser->name = name;
   lisp_parser->buffer = buffer;
   lisp_parser->parser = parser;
   lisp_parser->tree = tree;
@@ -156,17 +179,35 @@ make_ts_node (Lisp_Object parser, TSNode node)
   return make_lisp_ptr (lisp_node, Lisp_Vectorlike);
 }
 
+DEFUN ("tree-sitter-node-parser",
+       Ftree_sitter_node_parser, Stree_sitter_node_parser,
+       1, 1, 0,
+       doc: /* Return the parser to which NODE belongs.  */)
+  (Lisp_Object node)
+{
+  CHECK_TS_NODE (node);
+  return XTS_NODE (node)->parser;
+}
 
 DEFUN ("tree-sitter-create-parser",
        Ftree_sitter_create_parser, Stree_sitter_create_parser,
-       2, 2, 0,
+       2, 3, 0,
        doc: /* Create and return a parser in BUFFER for LANGUAGE.
+
 The parser is automatically added to BUFFER's
 `tree-sitter-parser-list'.  LANGUAGE should be the language provided
-by a tree-sitter language dynamic module.  */)
-  (Lisp_Object buffer, Lisp_Object language)
+by a tree-sitter language dynamic module.
+
+NAME (a string) is the name assigned to the parser, like the name for
+a process.  Unlike process names, not care is taken to make each
+parser's name unique.  By default, no name is assigned to the parser;
+the only consequence of that is you can't use
+`tree-sitter-get-parser' to find the parser by its name.  */)
+  (Lisp_Object buffer, Lisp_Object language, Lisp_Object name)
 {
   CHECK_BUFFER(buffer);
+  if (!NILP (name))
+    CHECK_STRING (name);
 
   /* LANGUAGE is a USER_PTR that contains the pointer to a TSLanguage
      struct.  */
@@ -175,9 +216,8 @@ DEFUN ("tree-sitter-create-parser",
   ts_parser_set_language (parser, lang);
 
   Lisp_Object lisp_parser
-    = make_ts_parser (XBUFFER(buffer), parser, NULL);
+    = make_ts_parser (XBUFFER(buffer), parser, NULL, name);
 
-  // FIXME: Is this the correct way to set a buffer-local variable?
   struct buffer *old_buffer = current_buffer;
   set_buffer_internal (XBUFFER (buffer));
 
@@ -188,6 +228,30 @@ DEFUN ("tree-sitter-create-parser",
   return lisp_parser;
 }
 
+DEFUN ("tree-sitter-parser-buffer",
+       Ftree_sitter_parser_buffer, Stree_sitter_parser_buffer,
+       1, 1, 0,
+       doc: /* Return the buffer of PARSER.  */)
+  (Lisp_Object parser)
+{
+  CHECK_TS_PARSER (parser);
+  Lisp_Object buf;
+  XSETBUFFER (buf, XTS_PARSER (parser)->buffer);
+  return buf;
+}
+
+DEFUN ("tree-sitter-parser-name",
+       Ftree_sitter_parser_name, Stree_sitter_parser_name,
+       1, 1, 0,
+       doc: /* Return parser's name.  */)
+  (Lisp_Object parser)
+{
+  CHECK_TS_PARSER (parser);
+  return XTS_PARSER (parser)->name;
+}
+
+/*** Parser API */
+
 DEFUN ("tree-sitter-parser-root-node",
        Ftree_sitter_parser_root_node, Stree_sitter_parser_root_node,
        1, 1, 0,
@@ -200,7 +264,8 @@ DEFUN ("tree-sitter-parser-root-node",
   return make_ts_node (parser, root_node);
 }
 
-DEFUN ("tree-sitter-parse", Ftree_sitter_parse, Stree_sitter_parse,
+DEFUN ("tree-sitter-parse-string",
+       Ftree_sitter_parse_string, Stree_sitter_parse_string,
        2, 2, 0,
        doc: /* Parse STRING and return the root node.
 LANGUAGE should be the language provided by a tree-sitter language
@@ -219,23 +284,20 @@ DEFUN ("tree-sitter-parse", Ftree_sitter_parse, Stree_sitter_parse,
 					 SSDATA (string),
 					 strlen (SSDATA (string)));
 
-  /* See comment for ts_parser_parse in tree_sitter/api.h
-     for possible reasons for a failure.  */
+  /* See comment in ts_ensure_parsed for possible reasons for a
+     failure.  */
   if (tree == NULL)
     signal_error ("Failed to parse STRING", string);
 
   TSNode root_node = ts_tree_root_node (tree);
 
-  Lisp_Object lisp_parser = make_ts_parser (NULL, parser, tree);
+  Lisp_Object lisp_parser = make_ts_parser (NULL, parser, tree, Qnil);
   Lisp_Object lisp_node = make_ts_node (lisp_parser, root_node);
 
   return lisp_node;
 }
 
-/* Below this point are uninteresting mechanical translations of
-   tree-sitter API.  */
-
-/* Node functions.  */
+/*** Node API  */
 
 DEFUN ("tree-sitter-node-type",
        Ftree_sitter_node_type, Stree_sitter_node_type, 1, 1, 0,
@@ -245,9 +307,31 @@ DEFUN ("tree-sitter-node-type",
   CHECK_TS_NODE (node);
   TSNode ts_node = XTS_NODE (node)->node;
   const char *type = ts_node_type(ts_node);
+  // TODO: Maybe return a string instead.
   return intern_c_string (type);
 }
 
+DEFUN ("tree-sitter-node-start-byte",
+       Ftree_sitter_node_start_byte, Stree_sitter_node_start_byte, 1, 1, 0,
+       doc: /* Return the NODE's start byte position.  */)
+  (Lisp_Object node)
+{
+  CHECK_TS_NODE (node);
+  TSNode ts_node = XTS_NODE (node)->node;
+  uint32_t start_byte = ts_node_start_byte(ts_node);
+  return make_fixnum(start_byte + 1);
+}
+
+DEFUN ("tree-sitter-node-end-byte",
+       Ftree_sitter_node_end_byte, Stree_sitter_node_end_byte, 1, 1, 0,
+       doc: /* Return the NODE's end byte position.  */)
+  (Lisp_Object node)
+{
+  CHECK_TS_NODE (node);
+  TSNode ts_node = XTS_NODE (node)->node;
+  uint32_t end_byte = ts_node_end_byte(ts_node);
+  return make_fixnum(end_byte + 1);
+}
 
 DEFUN ("tree-sitter-node-string",
        Ftree_sitter_node_string, Stree_sitter_node_string, 1, 1, 0,
@@ -303,9 +387,9 @@ DEFUN ("tree-sitter-node-check",
        Ftree_sitter_node_check, Stree_sitter_node_check, 2, 2, 0,
        doc: /* Return non-nil if NODE is in condition COND, nil otherwise.
 
-COND could be 'named, 'missing, 'extra, 'has-error.  Named nodes
-correspond to named rules in the grammar, whereas "anonymous" nodes
-correspond to string literals in the grammar.
+COND could be 'named, 'missing, 'extra, 'has-changes, 'has-error.
+Named nodes correspond to named rules in the grammar, whereas
+"anonymous" nodes correspond to string literals in the grammar.
 
 Missing nodes are inserted by the parser in order to recover from
 certain kinds of syntax errors, i.e., should be there but not there.
@@ -313,6 +397,9 @@ DEFUN ("tree-sitter-node-check",
 Extra nodes represent things like comments, which are not required the
 grammar, but can appear anywhere.
 
+A node "has changes" if the buffer changed since the node is
+created. (Don't forget the "s" at the end of 'has-changes.)
+
 A node "has error" if itself is a syntax error or contains any syntax
 errors.  */)
   (Lisp_Object node, Lisp_Object cond)
@@ -329,7 +416,10 @@ DEFUN ("tree-sitter-node-check",
     result = ts_node_is_extra (ts_node);
   else if (EQ (cond, Qhas_error))
     result = ts_node_has_error (ts_node);
+  else if (EQ (cond, Qhas_changes))
+    result = ts_node_has_changes (ts_node);
   else
+    // TODO: Is this a good error message?
     signal_error ("Expecting one of four symbols, see docstring", cond);
   return result ? Qt : Qnil;
 }
@@ -432,8 +522,177 @@ DEFUN ("tree-sitter-node-prev-sibling",
   return make_ts_node(XTS_NODE (node)->parser, sibling);
 }
 
+DEFUN ("tree-sitter-node-first-child-for-byte",
+       Ftree_sitter_node_first_child_for_byte,
+       Stree_sitter_node_first_child_for_byte, 2, 3, 0,
+       doc: /* Return the first child of NODE on POS.
+Specifically, return the first child that extends beyond POS.  POS is
+a byte position in the buffer counting from 1.  Return nil if there
+isn't any.  If NAMED is non-nil, look for named child only.  NAMED
+defaults to nil.  Note that this function returns an immediate child,
+not the smallest (grand)child.  */)
+  (Lisp_Object node, Lisp_Object pos, Lisp_Object named)
+{
+  CHECK_INTEGER (pos);
+
+  struct buffer *buf = (XTS_PARSER (XTS_NODE (node)->parser)->buffer);
+  ptrdiff_t byte_pos = XFIXNUM (pos);
+
+  if (byte_pos < BUF_BEGV_BYTE (buf) || byte_pos > BUF_ZV_BYTE (buf))
+    xsignal1 (Qargs_out_of_range, pos);
+
+  TSNode ts_node = XTS_NODE (node)->node;
+  TSNode child;
+  if (NILP (named))
+    child = ts_node_first_child_for_byte (ts_node, byte_pos - 1);
+  else
+    child = ts_node_first_named_child_for_byte (ts_node, byte_pos - 1);
+
+  if (ts_node_is_null(child))
+    return Qnil;
+
+  return make_ts_node(XTS_NODE (node)->parser, child);
+}
+
+DEFUN ("tree-sitter-node-descendant-for-byte-range",
+       Ftree_sitter_node_descendant_for_byte_range,
+       Stree_sitter_node_descendant_for_byte_range, 3, 4, 0,
+       doc: /* Return the smallest node that covers BEG to END.
+The returned node is a descendant of NODE.  POS is a byte position
+counting from 1.  Return nil if there isn't any.  If NAMED is non-nil,
+look for named child only.  NAMED defaults to nil.  */)
+  (Lisp_Object node, Lisp_Object beg, Lisp_Object end, Lisp_Object named)
+{
+  CHECK_INTEGER (beg);
+  CHECK_INTEGER (end);
+
+  struct buffer *buf = (XTS_PARSER (XTS_NODE (node)->parser)->buffer);
+  ptrdiff_t byte_beg = XFIXNUM (beg);
+  ptrdiff_t byte_end = XFIXNUM (end);
+
+  /* Checks for BUFFER_BEG <= BEG <= END <= BUFFER_END.  */
+  if (!(BUF_BEGV_BYTE (buf) <= byte_beg
+	&& byte_beg <= byte_end
+	&& byte_end <= BUF_ZV_BYTE (buf)))
+    xsignal2 (Qargs_out_of_range, beg, end);
+
+  TSNode ts_node = XTS_NODE (node)->node;
+  TSNode child;
+  if (NILP (named))
+    child = ts_node_descendant_for_byte_range
+      (ts_node, byte_beg - 1 , byte_end - 1);
+  else
+    child = ts_node_named_descendant_for_byte_range
+      (ts_node, byte_beg - 1, byte_end - 1);
+
+  if (ts_node_is_null(child))
+    return Qnil;
+
+  return make_ts_node(XTS_NODE (node)->parser, child);
+}
+
 /* Query functions */
 
+Lisp_Object ts_query_error_to_string (TSQueryError error)
+{
+  char *error_name;
+  switch (error)
+    {
+    case TSQueryErrorNone:
+      error_name = "none";
+      break;
+    case TSQueryErrorSyntax:
+      error_name = "syntax";
+      break;
+    case TSQueryErrorNodeType:
+      error_name = "node type";
+      break;
+    case TSQueryErrorField:
+      error_name = "field";
+      break;
+    case TSQueryErrorCapture:
+      error_name = "capture";
+      break;
+    case TSQueryErrorStructure:
+      error_name = "structure";
+      break;
+    }
+  return  make_pure_c_string (error_name, strlen(error_name));
+}
+
+DEFUN ("tree-sitter-query-capture",
+       Ftree_sitter_query_capture,
+       Stree_sitter_query_capture, 2, 4, 0,
+       doc: /* Query NODE with PATTERN.
+
+Returns a list of (CAPTURE_NAME . NODE).  CAPTURE_NAME is the name
+assigned to the node in PATTERN.  NODE is the captured node.
+
+PATTERN is a string containing one or more matching patterns.  See
+manual for further explanation for how to write a match pattern.
+
+BEG and END, if _both_ non-nil, specifies the range in which the query
+is executed.
+
+Return nil if the query failed.  */)
+  (Lisp_Object node, Lisp_Object pattern,
+   Lisp_Object beg, Lisp_Object end)
+{
+  CHECK_TS_NODE (node);
+  CHECK_STRING (pattern);
+
+  TSNode ts_node = XTS_NODE (node)->node;
+  Lisp_Object lisp_parser = XTS_NODE (node)->parser;
+  const TSLanguage *lang = ts_parser_language
+    (XTS_PARSER (lisp_parser)->parser);
+  char *source = SSDATA (pattern);
+
+  uint32_t error_offset;
+  uint32_t error_type;
+  TSQuery *query = ts_query_new (lang, source, strlen (source),
+				 &error_offset, &error_type);
+  TSQueryCursor *cursor = ts_query_cursor_new ();
+
+  if (query == NULL)
+    {
+      // FIXME: Signal an error?
+      return Qnil;
+    }
+  if (!NILP (beg) && !NILP (end))
+    {
+      EMACS_INT beg_byte = XFIXNUM (beg);
+      EMACS_INT end_byte = XFIXNUM (end);
+      ts_query_cursor_set_byte_range
+	(cursor, (uint32_t) beg_byte - 1, (uint32_t) end_byte - 1);
+    }
+
+  ts_query_cursor_exec (cursor, query, ts_node);
+  TSQueryMatch match;
+  TSQueryCapture capture;
+  Lisp_Object result = Qnil;
+  Lisp_Object entry;
+  Lisp_Object captured_node;
+  const char *capture_name;
+  uint32_t capture_name_len;
+  while (ts_query_cursor_next_match (cursor, &match))
+    {
+      const TSQueryCapture *captures = match.captures;
+      for (int idx=0; idx < match.capture_count; idx++)
+	{
+	  capture = captures[idx];
+	  captured_node = make_ts_node(lisp_parser, capture.node);
+	  capture_name = ts_query_capture_name_for_id
+	    (query, capture.index, &capture_name_len);
+	  entry = Fcons (intern_c_string (capture_name),
+			 captured_node);
+	  result = Fcons (entry, result);
+	}
+    }
+  ts_query_delete (query);
+  ts_query_cursor_delete (cursor);
+  return Freverse (result);
+}
+
 /* Initialize the tree-sitter routines.  */
 void
 syms_of_tree_sitter (void)
@@ -443,11 +702,18 @@ syms_of_tree_sitter (void)
   DEFSYM (Qnamed, "named");
   DEFSYM (Qmissing, "missing");
   DEFSYM (Qextra, "extra");
+  DEFSYM (Qhas_changes, "has-changes");
   DEFSYM (Qhas_error, "has-error");
 
+  DEFSYM (Qtree_sitter_query_error, "tree-sitter-query-error");
+  Fput (Qtree_sitter_query_error, Qerror_conditions,
+	pure_list (Qtree_sitter_query_error, Qerror));
+  Fput (Qtree_sitter_query_error, Qerror_message,
+	build_pure_c_string ("Error with query pattern"))
+
   DEFSYM (Qtree_sitter_parser_list, "tree-sitter-parser-list");
-  DEFVAR_LISP ("ts-parser-list", Vtree_sitter_parser_list,
-		     doc: /* A list of tree-sitter parsers.
+  DEFVAR_LISP ("tree-sitter-parser-list", Vtree_sitter_parser_list,
+	       doc: /* A list of tree-sitter parsers.
 // TODO: more doc.
 If you removed a parser from this list, do not put it back in.  */);
   Vtree_sitter_parser_list = Qnil;
@@ -455,11 +721,19 @@ syms_of_tree_sitter (void)
 
   defsubr (&Stree_sitter_parser_p);
   defsubr (&Stree_sitter_node_p);
+
+  defsubr (&Stree_sitter_node_parser);
+
   defsubr (&Stree_sitter_create_parser);
+  defsubr (&Stree_sitter_parser_buffer);
+  defsubr (&Stree_sitter_parser_name);
+
   defsubr (&Stree_sitter_parser_root_node);
-  defsubr (&Stree_sitter_parse);
+  defsubr (&Stree_sitter_parse_string);
 
   defsubr (&Stree_sitter_node_type);
+  defsubr (&Stree_sitter_node_start_byte);
+  defsubr (&Stree_sitter_node_end_byte);
   defsubr (&Stree_sitter_node_string);
   defsubr (&Stree_sitter_node_parent);
   defsubr (&Stree_sitter_node_child);
@@ -469,4 +743,8 @@ syms_of_tree_sitter (void)
   defsubr (&Stree_sitter_node_child_by_field_name);
   defsubr (&Stree_sitter_node_next_sibling);
   defsubr (&Stree_sitter_node_prev_sibling);
+  defsubr (&Stree_sitter_node_first_child_for_byte);
+  defsubr (&Stree_sitter_node_descendant_for_byte_range);
+
+  defsubr (&Stree_sitter_query_capture);
 }
diff --git a/src/tree_sitter.h b/src/tree_sitter.h
index a7e2a2d670..e9b4a71326 100644
--- a/src/tree_sitter.h
+++ b/src/tree_sitter.h
@@ -33,6 +33,7 @@ #define EMACS_TREE_SITTER_H
 struct Lisp_TS_Parser
 {
   union vectorlike_header header;
+  Lisp_Object name;
   struct buffer *buffer;
   TSParser *parser;
   TSTree *tree;
@@ -95,7 +96,8 @@ ts_record_change (ptrdiff_t start_byte, ptrdiff_t old_end_byte,
 		  ptrdiff_t new_end_byte);
 
 Lisp_Object
-make_ts_parser (struct buffer *buffer, TSParser *parser, TSTree *tree);
+make_ts_parser (struct buffer *buffer, TSParser *parser,
+		TSTree *tree, Lisp_Object name);
 
 Lisp_Object
 make_ts_node (Lisp_Object parser, TSNode node);
diff --git a/test/src/tree-sitter-tests.el b/test/src/tree-sitter-tests.el
index cb1c464d3a..c61ad678d2 100644
--- a/test/src/tree-sitter-tests.el
+++ b/test/src/tree-sitter-tests.el
@@ -21,6 +21,7 @@
 
 (require 'ert)
 (require 'tree-sitter-json)
+(require 'tree-sitter)
 
 (ert-deftest tree-sitter-basic-parsing ()
   "Test basic parsing routines."
@@ -52,12 +53,13 @@ tree-sitter-basic-parsing
 (ert-deftest tree-sitter-node-api ()
   "Tests for node API."
   (with-temp-buffer
-    (insert "[1,2,{\"name\": \"Bob\"},3]")
     (let (parser root-node doc-node object-node pair-node)
-      (setq parser (tree-sitter-create-parser
-                    (current-buffer) (tree-sitter-json)))
-      (setq root-node (tree-sitter-parser-root-node
-                       parser))
+      (progn
+        (insert "[1,2,{\"name\": \"Bob\"},3]")
+        (setq parser (tree-sitter-create-parser
+                      (current-buffer) (tree-sitter-json)))
+        (setq root-node (tree-sitter-parser-root-node
+                         parser)))
       ;; `tree-sitter-node-type'.
       (should (eq 'document (tree-sitter-node-type root-node)))
       ;; `tree-sitter-node-check'.
@@ -100,7 +102,51 @@ tree-sitter-node-api
       (should (equal "(\",\")"
                      (tree-sitter-node-string
                       (tree-sitter-node-prev-sibling object-node))))
-      )))
+      ;; `tree-sitter-node-first-child-for-byte'.
+      (should (equal "(number)"
+                     (tree-sitter-node-string
+                      (tree-sitter-node-first-child-for-byte
+                       doc-node 3 t))))
+      (should (equal "(\",\")"
+                     (tree-sitter-node-string
+                      (tree-sitter-node-first-child-for-byte
+                       doc-node 3))))
+      ;; `tree-sitter-node-descendant-for-byte-range'.
+      (should (equal "(\"{\")"
+                     (tree-sitter-node-string
+                      (tree-sitter-node-descendant-for-byte-range
+                       root-node 6 7))))
+      (should (equal "(object (pair key: (string (string_content)) value: (string (string_content))))"
+                     (tree-sitter-node-string
+                      (tree-sitter-node-descendant-for-byte-range
+                       root-node 6 7 t)))))))
+
+(ert-deftest tree-sitter-query-api ()
+  "Tests for query API."
+  (with-temp-buffer
+    (let (parser root-node pattern doc-node object-node pair-node)
+      (progn
+        (insert "[1,2,{\"name\": \"Bob\"},3]")
+        (setq parser (tree-sitter-create-parser
+                      (current-buffer) (tree-sitter-json)))
+        (setq root-node (tree-sitter-parser-root-node
+                         parser))
+        (setq pattern "(string) @string
+(pair key: (_) @keyword)
+(number) @number"))
+
+      (should
+       (equal
+        '((number . "1") (number . "2")
+          (keyword . "\"name\"")
+          (string . "\"name\"")
+          (string . "\"Bob\"")
+          (number . "3"))
+        (mapcar (lambda (entry)
+                  (cons (car entry)
+                        (tree-sitter-node-content
+                         (cdr entry))))
+                (tree-sitter-query-capture root-node pattern)))))))
 
 (provide 'tree-sitter-tests)
 ;;; tree-sitter-tests.el ends here
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24 14:32                                           ` Eli Zaretskii
@ 2021-07-24 15:10                                             ` Stefan Monnier
  2021-07-24 15:51                                               ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Stefan Monnier @ 2021-07-24 15:10 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, emacs-devel

>> I see absolutely no problem with scaling in making a copy: the extra
>> memory and CPU time taken by the copy will be a constant factor which
>> I don't expect to go much beyond 10%
> 10% of what?  It will be 100% of all the buffers that need parsing.

10% of the memory used by that buffer, since TS's data structure eats up
about 10x the size of the buffer's text.

Given the memory needs of TS we may decide to have
a `tree-sitter-maximum-size` config to disable TS on overly large
buffers (just like font-lock has such a setting, since when used
without jit-lock, font-lock also can easily end up using more memory
than the buffer's text).


        Stefan




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24 15:04                                                           ` Yuan Fu
@ 2021-07-24 15:48                                                             ` Eli Zaretskii
  2021-07-24 17:14                                                               ` Yuan Fu
  2021-07-26 14:38                                                               ` Perry E. Metzger
  2021-07-24 16:14                                                             ` Eli Zaretskii
  1 sibling, 2 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-24 15:48 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Sat, 24 Jul 2021 11:04:35 -0400
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  emacs-devel@gnu.org
> 
> I wrote a simple interface between font-lock and tree-sitter, and it works pretty well: using tree-sitter for fontification, xdisp.c opens a lot faster, and scrolling through the buffer is also perceivably faster. My simple interface works like this: tree-sitter allow you to “pattern match” nodes in the parse tree with a DSL, and assign names to the matched nodes, e.g., given a pattern, you get back a list of (NAME . MATCHED-NODE). And if we use font-lock faces as names for those nodes, we get back a list of (FACE . MATCHED-NODE) from tree-sitter, and Emacs can simply look at the beginning and end of the node, and apply FACE to that range. For flexibility, FACE can also be a function, in which case the function is called with the node. This interface is basically what emacs-tree-sitt
 er does (I don’t know if they allow a capture name to be a function, though.)
> 
> I have an example major-mode for C that uses tree-sitter for font-locking at the end of tree-sitter.el. 
> 
> Main functions to look at: tree-sitter-query-capture in tree_sitter.c, and tree-sitter-fontify-region-function in tree-sitter.el.

Thanks!

> BTW, what is the best way to signal a lisp error from C? I tried xsignal2, signal_error, error and friends but they seem to crash Emacs. Maybe I wasn’t using them correctly.

xsignal2 should work, as should xsignal.  Please show the code which
crashed.

> IIUC if we want tree-sitter to use our malloc, we need to build it with Emacs, where should I put the source of tree-sitter?

tree-sitter itself should be a library we link against.  If you meant
the tree-sitter support code, then it should go on a separate file in
src/.  Or did I misunderstand your question?

> What’s the different between make_string and make_pure_c_string? I’ve seen this “pure” thing else where, what does “pure” mean?

I suggest to read the node "Pure Storage" in the ELisp manual.  It
explains that.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24 15:10                                             ` Stefan Monnier
@ 2021-07-24 15:51                                               ` Eli Zaretskii
  0 siblings, 0 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-24 15:51 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: cpitclaudel, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: cpitclaudel@gmail.com,  emacs-devel@gnu.org
> Date: Sat, 24 Jul 2021 11:10:26 -0400
> 
> >> I see absolutely no problem with scaling in making a copy: the extra
> >> memory and CPU time taken by the copy will be a constant factor which
> >> I don't expect to go much beyond 10%
> > 10% of what?  It will be 100% of all the buffers that need parsing.
> 
> 10% of the memory used by that buffer, since TS's data structure eats up
> about 10x the size of the buffer's text.

That's still a lot of wasted storage, let alone time.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24 15:04                                                           ` Yuan Fu
  2021-07-24 15:48                                                             ` Eli Zaretskii
@ 2021-07-24 16:14                                                             ` Eli Zaretskii
  2021-07-24 17:32                                                               ` Yuan Fu
  1 sibling, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-24 16:14 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Sat, 24 Jul 2021 11:04:35 -0400
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  emacs-devel@gnu.org
> 
> +(define-derived-mode ts-c-mode prog-mode "TS C"
> +  "C mode with tree-sitter support."
> +  (setq-local font-lock-fontify-region-function
> +              #'tree-sitter-fontify-region-function)
> +  (setq-local tree-sitter-font-lock-settings
> +              `(("font-lock-c"
> +                 ,(tree-sitter-c)
> +                 "(null) @font-lock-constant-face
> +(true) @font-lock-constant-face
> +(false) @font-lock-constant-face
> +
> +(comment) @font-lock-comment-face
> +
> +(system_lib_string) @ts-c-fontify-system-lib
> +
> +(unary_expression
> +  operator: _ @font-lock-negation-char-face)
> +
> +(string_literal) @font-lock-string-face
> +(char_literal) @font-lock-string-face

Where does this repertoire of possible syntax categories come from?
Is this from some list that TS exposes or documents?  If so, what
happens when the repertoire is modified?

>        beg = (char *) BUF_BYTE_ADDRESS (buffer, byte_pos);
> -      len = next_char_len(byte_pos);
> +      len = BYTES_BY_CHAR_HEAD ((int) beg);

The last line is wrong: you need the byte itself.  So it should be:

      len = BYTES_BY_CHAR_HEAD (*beg);



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24 15:48                                                             ` Eli Zaretskii
@ 2021-07-24 17:14                                                               ` Yuan Fu
  2021-07-24 17:20                                                                 ` Eli Zaretskii
  2021-07-26 14:38                                                               ` Perry E. Metzger
  1 sibling, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-07-24 17:14 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, Stefan Monnier, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 667 bytes --]

> 
>> IIUC if we want tree-sitter to use our malloc, we need to build it with Emacs, where should I put the source of tree-sitter?
> 
> tree-sitter itself should be a library we link against.  If you meant
> the tree-sitter support code, then it should go on a separate file in
> src/.  Or did I misunderstand your question?

If we link against libtree-sitter, how do we change its malloc behavior? Tree-sitter has these kind of things:

#ifndef ts_malloc
#define ts_malloc  ts_malloc_default
#endif

So I assume we need to define ts_malloc to, say, xmalloc when compiling libtree-sitter. And if we only link to it, we can’t redefine ts_malloc.

Yuan

[-- Attachment #2: Type: text/html, Size: 4452 bytes --]

^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24 17:14                                                               ` Yuan Fu
@ 2021-07-24 17:20                                                                 ` Eli Zaretskii
  2021-07-24 17:40                                                                   ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-24 17:20 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Sat, 24 Jul 2021 13:14:50 -0400
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,
>  cpitclaudel@gmail.com,
>  emacs-devel@gnu.org
> 
>  IIUC if we want tree-sitter to use our malloc, we need to build it with Emacs, where should I put the
>  source of tree-sitter?
> 
>  tree-sitter itself should be a library we link against.  If you meant
>  the tree-sitter support code, then it should go on a separate file in
>  src/.  Or did I misunderstand your question?
> 
> If we link against libtree-sitter, how do we change its malloc behavior? Tree-sitter has these kind of things:
> 
> #ifndef ts_malloc
> #define ts_malloc  ts_malloc_default
> #endif
> 
> So I assume we need to define ts_malloc to, say, xmalloc when compiling libtree-sitter. And if we only link to
> it, we can’t redefine ts_malloc.

How does TS propose the client projects to do that?  Are you saying
that the only way to replace its malloc is to recompile tree-sitter??



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24 16:14                                                             ` Eli Zaretskii
@ 2021-07-24 17:32                                                               ` Yuan Fu
  2021-07-24 17:42                                                                 ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-07-24 17:32 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, monnier, emacs-devel



> On Jul 24, 2021, at 12:14 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: Yuan Fu <casouri@gmail.com>
>> Date: Sat, 24 Jul 2021 11:04:35 -0400
>> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,
>> Clément Pit-Claudel <cpitclaudel@gmail.com>,
>> emacs-devel@gnu.org
>> 
>> +(define-derived-mode ts-c-mode prog-mode "TS C"
>> +  "C mode with tree-sitter support."
>> +  (setq-local font-lock-fontify-region-function
>> +              #'tree-sitter-fontify-region-function)
>> +  (setq-local tree-sitter-font-lock-settings
>> +              `(("font-lock-c"
>> +                 ,(tree-sitter-c)
>> +                 "(null) @font-lock-constant-face
>> +(true) @font-lock-constant-face
>> +(false) @font-lock-constant-face
>> +
>> +(comment) @font-lock-comment-face
>> +
>> +(system_lib_string) @ts-c-fontify-system-lib
>> +
>> +(unary_expression
>> +  operator: _ @font-lock-negation-char-face)
>> +
>> +(string_literal) @font-lock-string-face
>> +(char_literal) @font-lock-string-face
> 
> Where does this repertoire of possible syntax categories come from?
> Is this from some list that TS exposes or documents?  If so, what
> happens when the repertoire is modified?

These “syntax categories” are defined by individual language grammar definition for tree-sitter, so it could change from language to language. And tree-sitter does not document them. If these “syntax categories” change, then we need to change our code with them. But I doubt that it will happen often. They are hard to document, because a non-trivial grammar definition often defines hundreds of them; the grammar definition for C has 1000 LOC.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24 17:20                                                                 ` Eli Zaretskii
@ 2021-07-24 17:40                                                                   ` Yuan Fu
  2021-07-24 17:46                                                                     ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-07-24 17:40 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Clément Pit-Claudel, Stefan Monnier, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1840 bytes --]



> On Jul 24, 2021, at 1:20 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: Yuan Fu <casouri@gmail.com <mailto:casouri@gmail.com>>
>> Date: Sat, 24 Jul 2021 13:14:50 -0400
>> Cc: Stefan Monnier <monnier@iro.umontreal.ca <mailto:monnier@iro.umontreal.ca>>,
>> cpitclaudel@gmail.com <mailto:cpitclaudel@gmail.com>,
>> emacs-devel@gnu.org <mailto:emacs-devel@gnu.org>
>> 
>> IIUC if we want tree-sitter to use our malloc, we need to build it with Emacs, where should I put the
>> source of tree-sitter?
>> 
>> tree-sitter itself should be a library we link against.  If you meant
>> the tree-sitter support code, then it should go on a separate file in
>> src/.  Or did I misunderstand your question?
>> 
>> If we link against libtree-sitter, how do we change its malloc behavior? Tree-sitter has these kind of things:
>> 
>> #ifndef ts_malloc
>> #define ts_malloc  ts_malloc_default
>> #endif
>> 
>> So I assume we need to define ts_malloc to, say, xmalloc when compiling libtree-sitter. And if we only link to
>> it, we can’t redefine ts_malloc.
> 
> How does TS propose the client projects to do that?  Are you saying
> that the only way to replace its malloc is to recompile tree-sitter??

Here is the relevant lines in alloc.h in tree-sitter:

// Allow clients to override allocation functions

#ifndef ts_malloc
#define ts_malloc  ts_malloc_default
#endif
#ifndef ts_calloc
#define ts_calloc  ts_calloc_default
#endif
#ifndef ts_realloc
#define ts_realloc ts_realloc_default
#endif
#ifndef ts_free
#define ts_free    ts_free_default
#endif

I’m not a C expert, does this allow us to replace its malloc in runtime?

Relative discussion found on the issue tracker: https://github.com/tree-sitter/tree-sitter/issues/739 <https://github.com/tree-sitter/tree-sitter/issues/739>

Yuan

[-- Attachment #2: Type: text/html, Size: 4793 bytes --]

^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24 17:32                                                               ` Yuan Fu
@ 2021-07-24 17:42                                                                 ` Eli Zaretskii
  0 siblings, 0 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-24 17:42 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Sat, 24 Jul 2021 13:32:18 -0400
> Cc: monnier@iro.umontreal.ca,
>  cpitclaudel@gmail.com,
>  emacs-devel@gnu.org
> 
> >> +(define-derived-mode ts-c-mode prog-mode "TS C"
> >> +  "C mode with tree-sitter support."
> >> +  (setq-local font-lock-fontify-region-function
> >> +              #'tree-sitter-fontify-region-function)
> >> +  (setq-local tree-sitter-font-lock-settings
> >> +              `(("font-lock-c"
> >> +                 ,(tree-sitter-c)
> >> +                 "(null) @font-lock-constant-face
> >> +(true) @font-lock-constant-face
> >> +(false) @font-lock-constant-face
> >> +
> >> +(comment) @font-lock-comment-face
> >> +
> >> +(system_lib_string) @ts-c-fontify-system-lib
> >> +
> >> +(unary_expression
> >> +  operator: _ @font-lock-negation-char-face)
> >> +
> >> +(string_literal) @font-lock-string-face
> >> +(char_literal) @font-lock-string-face
> > 
> > Where does this repertoire of possible syntax categories come from?
> > Is this from some list that TS exposes or documents?  If so, what
> > happens when the repertoire is modified?
> 
> These “syntax categories” are defined by individual language grammar definition for tree-sitter, so it could change from language to language. And tree-sitter does not document them. If these “syntax categories” change, then we need to change our code with them. But I doubt that it will happen often. They are hard to document, because a non-trivial grammar definition often defines hundreds of them; the grammar definition for C has 1000 LOC.

Isn't there a better way of updating those than manually take them out
of the TS grammar?  Maybe write a short program linked against TS that
would spill them in some format that's convenient to use?  Manual
updates are a serious maintenance burden.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24 17:40                                                                   ` Yuan Fu
@ 2021-07-24 17:46                                                                     ` Eli Zaretskii
  2021-07-24 18:06                                                                       ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-24 17:46 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Sat, 24 Jul 2021 13:40:28 -0400
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  emacs-devel@gnu.org
> 
>  How does TS propose the client projects to do that?  Are you saying
>  that the only way to replace its malloc is to recompile tree-sitter??
> 
> Here is the relevant lines in alloc.h in tree-sitter:
> 
> // Allow clients to override allocation functions
> 
> #ifndef ts_malloc
> #define ts_malloc  ts_malloc_default
> #endif
> #ifndef ts_calloc
> #define ts_calloc  ts_calloc_default
> #endif
> #ifndef ts_realloc
> #define ts_realloc ts_realloc_default
> #endif
> #ifndef ts_free
> #define ts_free    ts_free_default
> #endif
> 
> I’m not a C expert, does this allow us to replace its malloc in runtime?

No, not AFAIU.  It only allows to make such changes when TS is
compiled.

We should ask the TS developers to provide a way of specifying custom
memory allocation/release function as part of TS initialization.  It
is a feature many packages provide.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24 17:46                                                                     ` Eli Zaretskii
@ 2021-07-24 18:06                                                                       ` Yuan Fu
  2021-07-24 18:21                                                                         ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-07-24 18:06 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, Stefan Monnier, emacs-devel


> We should ask the TS developers to provide a way of specifying custom
> memory allocation/release function as part of TS initialization.  It
> is a feature many packages provide.

I commented on tree-sitter’s 1.0 checklist.

> Isn't there a better way of updating those than manually take them out
> of the TS grammar?  Maybe write a short program linked against TS that
> would spill them in some format that's convenient to use?  Manual
> updates are a serious maintenance burden.

How does this convenient format looks like, in your mind? The grammar definition is already the “source”, I don’t see a way to magically make it easier to work with. What does “manual updates” refer to? If you mean updating patterns like

(init_declarator
 declarator: (identifier) @font-lock-variable-name-face)

(parameter_declaration
 declarator: (identifier) @font-lock-variable-name-face)

when a language’s grammar changes, I don’t think we need to update them often, or ever. And It is not harder than updating font-lock-keywords when a language adds a new fancy syntax.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24 18:06                                                                       ` Yuan Fu
@ 2021-07-24 18:21                                                                         ` Eli Zaretskii
  2021-07-24 18:55                                                                           ` Stefan Monnier
  2021-07-25 18:44                                                                           ` Stephen Leake
  0 siblings, 2 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-24 18:21 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Sat, 24 Jul 2021 14:06:52 -0400
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,
>  cpitclaudel@gmail.com,
>  emacs-devel@gnu.org
> 
> > We should ask the TS developers to provide a way of specifying custom
> > memory allocation/release function as part of TS initialization.  It
> > is a feature many packages provide.
> 
> I commented on tree-sitter’s 1.0 checklist.

Thanks.

> > Isn't there a better way of updating those than manually take them out
> > of the TS grammar?  Maybe write a short program linked against TS that
> > would spill them in some format that's convenient to use?  Manual
> > updates are a serious maintenance burden.
> 
> How does this convenient format looks like, in your mind? The grammar definition is already the “source”, I don’t see a way to magically make it easier to work with. What does “manual updates” refer to? If you mean updating patterns like
> 
> (init_declarator
>  declarator: (identifier) @font-lock-variable-name-face)
> 
> (parameter_declaration
>  declarator: (identifier) @font-lock-variable-name-face)
> 
> when a language’s grammar changes, I don’t think we need to update them often, or ever. And It is not harder than updating font-lock-keywords when a language adds a new fancy syntax.

It isn't an immediate problem, so we can delay it for later.

However, I do worry about the ability to update this in some
non-manual way.  Take for example the way we update our character
databases when Unicode adds more characters/scripts: we use the data
files distributed by the Unicode Consortium and process them with
scripts in admin/unidata to produce intermediate files in a format
convenient for processing by Emacs, then we process those intermediate
files as part of building Emacs.  Unicode files change maybe or twice
a year, but still, doing all those changes manually would be a burden.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24 18:21                                                                         ` Eli Zaretskii
@ 2021-07-24 18:55                                                                           ` Stefan Monnier
  2021-07-25 18:44                                                                           ` Stephen Leake
  1 sibling, 0 replies; 284+ messages in thread
From: Stefan Monnier @ 2021-07-24 18:55 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Yuan Fu, cpitclaudel, emacs-devel

> However, I do worry about the ability to update this in some
> non-manual way.

It has to be manual to the extent that it's not something that is
inherent to the BNF grammar.  The rules could accompany the grammar
(and TS could give access to them), in which case presumably all editors
using TS would end up fontifying in the same way, which would
make a fair bit of sense.

But in any case, this seems to be a preoccupation that goes much beyond
the actual immediate integration of tree-sitter into Emacs, and concerns
instead the evolution of tree-sitter itself.


        Stefan




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24  9:33                                               ` Stephen Leake
@ 2021-07-24 22:54                                                 ` Dmitry Gutov
  0 siblings, 0 replies; 284+ messages in thread
From: Dmitry Gutov @ 2021-07-24 22:54 UTC (permalink / raw)
  To: Stephen Leake, Yuan Fu
  Cc: Eli Zaretskii, Clément Pit-Claudel, Stefan Monnier, emacs-devel

On 24.07.2021 12:33, Stephen Leake wrote:
>> Plus, if tree-sitter respects narrowing, it could happen where a user
>> narrows the buffer, the font-locking changes and is not correct
>> anymore. Maybe that’s not the user want.
> Exactly. The indent will be wrong, too, if narrowing excludes a
> containing block.

The important pieces of code now (in recent Emacs versions) undo 
narrowing when do fundamental operations like parsing the buffer (with 
syntax-spss), applying font-lock rules or doing indentation, unless 
instructed otherwise by the major mode, or the multiple-major-mode 
framework.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24  6:51                                 ` Eli Zaretskii
@ 2021-07-25 16:16                                   ` Stephen Leake
  0 siblings, 0 replies; 284+ messages in thread
From: Stephen Leake @ 2021-07-25 16:16 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: casouri, monnier, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Stephen Leake <stephen_leake@stephe-leake.org>
>> Cc: casouri@gmail.com,  monnier@iro.umontreal.ca,  emacs-devel@gnu.org
>> Date: Fri, 23 Jul 2021 19:00:12 -0700
>> 
>> Eli Zaretskii <eliz@gnu.org> writes:
>> 
>> >> > I fail to see the significance of the difference.  Surely, you could
>> >> > hand it a block of text with changes to mean that this block replaces
>> >> > the previous version of that block.  It might take the parser more
>> >> > work to update the parse tree in this case, but if it's fast enough,
>> >> > that won't be the problem.  Right?
>> >> 
>> >> tree-sitter doesn't store the previous text, so there's nothing to
>> >> compare it to.
>> >
>> > There was nothing about comparison in my text.  You tell TS that
>> > editing replaced a block of text between A and B with block between A
>> > and C, without revealing the fine-grained changes inside that block.
>> > This must work, because editing could indeed do just that.
>> 
>> I see; treat the whole block as one change. Yes, that would work, but it
>> would probably be less optimal than sending a list of smaller changes;
>> depends on the details.
>
> Since TS is very fast, I think this sub-optimality will not cause any
> tangible performance issues in Emacs.  And from our POV it is a good
> optimization because it will minimize (and to some extent optimize)
> the traffic between Emacs and TS.

"optimal" refers to more than speed; error recovery is also important.
The more of the previous tree you keep, the better the error recovery.

After we get some good metrics/benchmarks for actual Emacs use (ie, how
good is the indentation?), we can explore this.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24  3:39                                                 ` Óscar Fuentes
  2021-07-24  7:34                                                   ` Eli Zaretskii
@ 2021-07-25 16:49                                                   ` Stephen Leake
  1 sibling, 0 replies; 284+ messages in thread
From: Stephen Leake @ 2021-07-25 16:49 UTC (permalink / raw)
  To: Óscar Fuentes; +Cc: emacs-devel

Óscar Fuentes <ofv@wanadoo.es> writes:

> Stephen Leake <stephen_leake@stephe-leake.org> writes:
>
>> 39 milliseconds for re-indent is just slow enough to be noticeable; I still
>> have algorithms to convert to be as incremental as possible.
>
> [snip]
>
>> In a very small file:
>>
>> initial   0.000632 seconds
>> re-indent 0.000942 seconds
>>
>> Easily fast enough to keep up with the user.
>
> Doing work every time the user changes the file is not always a good
> thing. 

It very much depends on the user's preferences. Note that in standard
Emacs usage, font-lock runs after every character is typed.

With the current ada-mode release, which uses partial parse instead of
incremental parse, the parse process cannot keep up with user typing. So
I run with jit-lock-defer-time set to 1.5 seconds. However, many people
want the fontification to be much more responsive. With wisi
incremental parse, ada-mode can now do that.

Since I got incremental parse working in wisi, I've set
jit-lock-defer-time to the default nil; I like it, and will not go back.
I mostly tolerated the delay before because I knew how hard it would be
to fix :).

> Nowadays the user doesn't just expect automatic indentation, he wants
> code formatting too, which means splitting, fusing and inserting
> lines, plus moving chunks of code left and right. Doing that every
> time a character is added or deleted can be visually confusing due to
> chunks of text changing positions as you type, so the systems I know
> are triggered by certain events (like the insertion of characters that
> mark the end of statements). 

Yes; different parser-based operations are triggered by different
events. That is true for wisi now; font-lock is triggered by the
standard Emacs mechanisms (ie, after every character is typed, the
window is scrolled, etc), indent is triggered by the standard Emacs
mechanisms (indent-region-function, indent-line-function; ie RET and
TAB), navigate (computing single-file cross-reference) is triggered by
forward-sexp or some similar "wisi-goto-*" function, reformatting is
triggered by "align" (in parallel with the standard Emacs align
mechanism) or a direct wisi-reformat-* function (there are some in a
context menu for Ada). All of these operations update the parse tree
only if the buffer has changed; if not, they use the existing tree.

The user can always customize things - wisi provides the framework.

> Then they analyze the code and, if it is well formed, apply the
> reformatting. Something similar could be said about fontification and
> other tasks.

Wisi does indentation even in the presence of syntax errors (ie, not
"well formed"). This helps when writing code; when entering an "if"
statement, you don't have to start with a complete template; you just
type the code. It does sometimes cause confusing results; fixing the
syntax always resolves that.

> So I'll insist on not obsessing too much about performance. Implement
> something simple, see if it is usable. If not, invest effort on
> optimizations until it is good enough.

Yes; premature optimization is the enemy of good enough. And good
benchmarks/metrics should be the guide of any optimization; allowing
font-lock to run after every character is typed is one such metric.
Indentation in the presence of syntax errors is another; that was the
primary complaint about ada-mode before I implemented error correction,
and is still a common complaint; incremental parse will improve that.
These two metrics were the trigger that started me implementing
incremental parse.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24  7:06                                                 ` Eli Zaretskii
@ 2021-07-25 17:48                                                   ` Stephen Leake
  0 siblings, 0 replies; 284+ messages in thread
From: Stephen Leake @ 2021-07-25 17:48 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Stephen Leake <stephen_leake@stephe-leake.org>
>> Cc: Clément Pit-Claudel <cpitclaudel@gmail.com>,
>>   emacs-devel@gnu.org
>> Date: Fri, 23 Jul 2021 19:57:32 -0700
>> 
>> > How much "less"?  Close to 1 sec is indeed annoying, but 20 msec or so
>> > should be bearable.
>> >
>> > You seem to assume up front that TS (re)-parsing will take 1 sec, but
>> > AFAIK there's no reason to assume such bad performance.
>> 
>> This is for the initial parse, on a large file. No matter how fast the
>> parser is, I can give you a file that takes one second to parse, and
>> some user will have such a file (the work always expands to consume all
>> the resources available).
>
> That problem is already with us: if I visit xdisp.c in an unoptimized
> build of Emacs 28, I wait almost 4 sec for the first window-full to be
> displayed.  (It's more like 0.5 sec in an optimized build of Emacs
> 27.2.)  So the real question is how much using TS will _improve_ the
> situation.

Yes. But here other solutions, like parsing only part of the buffer,
offer much better improvement.

>> I just got incremental parse working well enough to measure it; in the
>> largest Ada file I have (10,000 lines from Eurocontrol):
>> 
>> initial parse:       1.539319 seconds
>> re-indent two lines: 0.038999 seconds
>> 
>> 39 milliseconds for re-indent is just slow enough to be noticeable; I still
>> have algorithms to convert to be as incremental as possible.
>
> For comparison, how much does re-indentation of 2 lines take in Emacs
> without a parser?

I don't think this is a meaningful question, or at least, I don't have
an answer.

For ada-mode, you'd have to go back to version 4.0, where the
indentation was ad-hoc elisp. It was fast enough to be not noticeable.
But I switched to a parser because that indentation algorithm was often
incorrect, and was very brittle in the face of new features in new Ada
language standard releases.

Other languages don't use a parser for indentation, so there's no way to
compare. Even the AdaCore editor Gnat Studio doesn't use their parser
for indentation in Ada; Emacs ada-mode is the only one I know of.

I guess you could say it's a trade of indentation quality vs speed.
Witness the recent thread about inconsistent fontification in C; a
parser would resolve that, but LSP via eglot is probably slower than the
current elisp. Indentation is similar, but the quality difference is
bigger, at least for Ada.

> 39 msec might be noticeable, but it isn't annoying; anything below 50
> msec isn't.  

You are right; in that large Ada file, I don't notice the font-lock
delay after typing each character.

> Try "C-x TAB" in Emacs on 10-line block of text, and you get more than
> that.

Depends on the mode;

text-mode: 0.4 microseconds.

In xdisp.c, indenting it_char_has_category, 47.5 milliseconds.

In benchmark.el, indenting benchmark-call; 1.2 milliseconds.

The computation here is font-lock due to the text moving in the buffer;
in the ada-mode benchmark above, it is computing indent. Calling
indent-rigidly, then indent-region (which results in zero net buffer
change, so apparently no significant font-lock), I get:

xdisp.c: 17.1 ms

benchmark.el: 3.6 ms

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24  6:00                                                           ` Eli Zaretskii
@ 2021-07-25 18:01                                                             ` Stephen Leake
  2021-07-25 19:09                                                               ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Stephen Leake @ 2021-07-25 18:01 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Yuan Fu, emacs-devel, cpitclaudel, monnier

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Yuan Fu <casouri@gmail.com>
>> Date: Fri, 23 Jul 2021 16:22:59 -0400
>> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,
>>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>>  emacs-devel@gnu.org
>> 
>> > We must replace this function, if only because the MS-Windows build of
>> > Emacs uses a custom malloc implementation.  Does TS allow the client
>> > to use its own malloc?
>> 
>> Yes, in that case, we need to embed tree-sitter into Emacs, instead
>> of using it as a dynamic library, I think.
>> 
>> // Allow clients to override allocation functions
>> #ifndef ts_malloc
>> #define ts_malloc  ts_malloc_default
>> #endif
>> #ifndef ts_calloc
>> #define ts_calloc  ts_calloc_default
>> #endif
>> #ifndef ts_realloc
>> #define ts_realloc ts_realloc_default
>> #endif
>> #ifndef ts_free
>> #define ts_free    ts_free_default
>> #endif
>> 
>> How do we handle such thing in Emacs?
>
> We use xmalloc, which calls memory_full when allocation fails, which
> releases some spare memory we have for this purpose, and tells the
> user to save the session and exit.

I'm thinking about how this applies to wisi, when migrating to a module.

Ada has a built-in allocator; it's probably possible to change that, but
I'd like to understand exactly why we need to do that.

The Ada allocator throws an exception on allocation fail; is it
sufficient to turn that exception into an elisp signal, and arrange for
elisp to call memory_full (or take some other action, like killing the
parser)?

Another possible reason to change the Ada allocator is if we want to
expose Ada memory pointers directly to elisp, as Yuan Fu wants to do for
tree-sitter (I don't plan to do this for wisi). Does that require that
the pointers be allocated by the same allocator? I'm not clear what that
would mean for the garbage collector; is it then expected to recover the
tree-sitter-allocated memory for the tree? or does it ignore those lisp
objects?

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24 11:22                                                   ` Eli Zaretskii
@ 2021-07-25 18:21                                                     ` Stephen Leake
  2021-07-25 19:03                                                       ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Stephen Leake @ 2021-07-25 18:21 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: casouri, emacs-devel, cpitclaudel, monnier

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Stephen Leake <stephen_leake@stephe-leake.org>
>> Cc: Yuan Fu <casouri@gmail.com>,  cpitclaudel@gmail.com,
>>   monnier@iro.umontreal.ca,  emacs-devel@gnu.org
>> Date: Sat, 24 Jul 2021 02:42:24 -0700
>> 
>> > But that's how the current font-lock and indentation work: they never
>> > look beyond the narrowing limits.  
>> 
>> And that's broken
>
> ??? Of course, it isn't: it's how Emacs has worked since v21.1.

Ada (and other languages, but not all) requires the full file text to
properly compute font and indent; narrowing breaks that.

The fix for font-lock is font-lock-dont-widen; I implemented a similar
mechanism for indent of an ada-mode region in multi-major-mode.

In plain ada-mode, indent is currently broken in a narrowed buffer;
wisi-indent-region does not widen because it is language-agnostic, and I
have not gotten around to implementing a "widen for indent" hook because I
don't use narrowing very often, and no one has complained.

>> unless the narrowing is for multi-major-mode.
>
> And what would you do in that case, if you allow TS to look beyond the
> restriction?

In the multi-major-mode case, there is a separate parser for each
language, and each sub-mode region in the text would get its own parser
tree (ie, it acts like a separate file), and that parser tree is only
told about changes to those regions. So the parser will never try to
look outside the region; it doesn't need to know about narrowing.

I'll have to upgrade my Ada multi-major-mode implementation to do this
for incremental parse.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24 18:21                                                                         ` Eli Zaretskii
  2021-07-24 18:55                                                                           ` Stefan Monnier
@ 2021-07-25 18:44                                                                           ` Stephen Leake
  1 sibling, 0 replies; 284+ messages in thread
From: Stephen Leake @ 2021-07-25 18:44 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Yuan Fu, emacs-devel, cpitclaudel, monnier

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Yuan Fu <casouri@gmail.com>
>>
>> when a language’s grammar changes, I don’t think we need to update
>> them often, or ever. And It is not harder than updating
>> font-lock-keywords when a language adds a new fancy syntax.
>
> It isn't an immediate problem, so we can delay it for later.
>
> However, I do worry about the ability to update this in some
> non-manual way.  Take for example the way we update our character
> databases when Unicode adds more characters/scripts: we use the data
> files distributed by the Unicode Consortium

If the language concerned has some standard definition that is machine
readable, then we could get partway there.

But no language standard specifies fontification (or indent), so there
is no standard machine-readable description of these.

tree-sitter provides a defacto standard for specifying fontification (in
the highlight rules files); it would make sense for emacs to be able to
read those files directly, along with linking to the corresponding
tree-sitter parser. There would have to provide a separate mapping from
tree-sitter notation to emacs font names. 

wisi provides a mechanism to describe fontification and indentation in
the grammar source file; every time ISO releases a new Ada language
version, I have to compare them and incorporate the changes. I've
written code to partly automate this (the language reference manual
contains the grammar in a variant of EBNF in an appendix, which is
mostly machine-readable), but it's highly Ada specific, and still mostly
a manual process. Fortunately it only happens every 10 years for Ada :).

Many languages provide some EBNF description of the language, but it is
often in a form that is not suitable for whatever parser generator you
are using; it is usually optimized for human understanding. I made it a
requirement for the wisi parser generator to use the Ada reference
grammar as closely as possible, but I still have to modify the grammar
to get reasonable performance. You can find many different Java grammars
on the web, optimized for different parser generators (there are two
different but nominally equivalent grammars in the Java docs).

--
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-25 18:21                                                     ` Stephen Leake
@ 2021-07-25 19:03                                                       ` Eli Zaretskii
  2021-07-26 16:40                                                         ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-25 19:03 UTC (permalink / raw)
  To: Stephen Leake; +Cc: casouri, emacs-devel, cpitclaudel, monnier

> From: Stephen Leake <stephen_leake@stephe-leake.org>
> Cc: casouri@gmail.com,  cpitclaudel@gmail.com,  monnier@iro.umontreal.ca,
>   emacs-devel@gnu.org
> Date: Sun, 25 Jul 2021 11:21:27 -0700
> 
> >> > But that's how the current font-lock and indentation work: they never
> >> > look beyond the narrowing limits.  
> >> 
> >> And that's broken
> >
> > ??? Of course, it isn't: it's how Emacs has worked since v21.1.
> 
> Ada (and other languages, but not all) requires the full file text to
> properly compute font and indent; narrowing breaks that.

Not relevant: if a major mode's fontification code knows it needs to
do that, it will call 'widen'.

The issue was what should the TS reader function do.  My firm opinion
is that it should not look beyond the restriction, because it isn't
its business to make those decisions.  If the caller needs to widen,
it will.

> >> unless the narrowing is for multi-major-mode.
> >
> > And what would you do in that case, if you allow TS to look beyond the
> > restriction?
> 
> In the multi-major-mode case, there is a separate parser for each
> language, and each sub-mode region in the text would get its own parser
> tree (ie, it acts like a separate file), and that parser tree is only
> told about changes to those regions. So the parser will never try to
> look outside the region; it doesn't need to know about narrowing.

Once again, we are talking about the function used by TS to read
buffer text.  Not about the parser or its caller.  Low-level code,
which knows nothing about the context, should never look beyond the
restriction.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-25 18:01                                                             ` Stephen Leake
@ 2021-07-25 19:09                                                               ` Eli Zaretskii
  2021-07-26  5:10                                                                 ` Stephen Leake
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-25 19:09 UTC (permalink / raw)
  To: Stephen Leake; +Cc: casouri, emacs-devel, cpitclaudel, monnier

> From: Stephen Leake <stephen_leake@stephe-leake.org>
> Cc: Yuan Fu <casouri@gmail.com>,  cpitclaudel@gmail.com,
>   monnier@iro.umontreal.ca,  emacs-devel@gnu.org
> Date: Sun, 25 Jul 2021 11:01:22 -0700
> 
> >> How do we handle such thing in Emacs?
> >
> > We use xmalloc, which calls memory_full when allocation fails, which
> > releases some spare memory we have for this purpose, and tells the
> > user to save the session and exit.
> 
> I'm thinking about how this applies to wisi, when migrating to a module.
> 
> Ada has a built-in allocator; it's probably possible to change that, but
> I'd like to understand exactly why we need to do that.

We need that to allow the user to save the session while he/she can.

> The Ada allocator throws an exception on allocation fail; is it
> sufficient to turn that exception into an elisp signal, and arrange for
> elisp to call memory_full (or take some other action, like killing the
> parser)?

What is a "lisp signal" in this context?

> Another possible reason to change the Ada allocator is if we want to
> expose Ada memory pointers directly to elisp, as Yuan Fu wants to do for
> tree-sitter (I don't plan to do this for wisi). Does that require that
> the pointers be allocated by the same allocator?

Same allocator as what?

> I'm not clear what that would mean for the garbage collector; is it
> then expected to recover the tree-sitter-allocated memory for the
> tree? or does it ignore those lisp objects?

It depends on which Lisp object you wrap those pointers.  User-pointer
object allow you to provide your own "finalizer" function.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-25 19:09                                                               ` Eli Zaretskii
@ 2021-07-26  5:10                                                                 ` Stephen Leake
  2021-07-26 12:56                                                                   ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Stephen Leake @ 2021-07-26  5:10 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: casouri, emacs-devel, cpitclaudel, monnier

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Stephen Leake <stephen_leake@stephe-leake.org>
>> Cc: Yuan Fu <casouri@gmail.com>,  cpitclaudel@gmail.com,
>>   monnier@iro.umontreal.ca,  emacs-devel@gnu.org
>> Date: Sun, 25 Jul 2021 11:01:22 -0700
>> 
>> >> How do we handle such thing in Emacs?
>> >
>> > We use xmalloc, which calls memory_full when allocation fails, which
>> > releases some spare memory we have for this purpose, and tells the
>> > user to save the session and exit.
>> 
>> I'm thinking about how this applies to wisi, when migrating to a module.
>> 
>> Ada has a built-in allocator; it's probably possible to change that, but
>> I'd like to understand exactly why we need to do that.
>
> We need that to allow the user to save the session while he/she can.
>
>> The Ada allocator throws an exception on allocation fail; is it
>> sufficient to turn that exception into an elisp signal, and arrange for
>> elisp to call memory_full (or take some other action, like killing the
>> parser)?
>
> What is a "lisp signal" in this context?

The module interface layer of wisi.el would do:

    (signal 'error "parser ran out of memory")

>> Another possible reason to change the Ada allocator is if we want to
>> expose Ada memory pointers directly to elisp, as Yuan Fu wants to do for
>> tree-sitter (I don't plan to do this for wisi). Does that require that
>> the pointers be allocated by the same allocator?
>
> Same allocator as what?

As other lisp symbols.

>> I'm not clear what that would mean for the garbage collector; is it
>> then expected to recover the tree-sitter-allocated memory for the
>> tree? or does it ignore those lisp objects?
>
> It depends on which Lisp object you wrap those pointers.  User-pointer
> object allow you to provide your own "finalizer" function.

Ok, that would work.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-26  5:10                                                                 ` Stephen Leake
@ 2021-07-26 12:56                                                                   ` Eli Zaretskii
  0 siblings, 0 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-26 12:56 UTC (permalink / raw)
  To: Stephen Leake; +Cc: casouri, emacs-devel, cpitclaudel, monnier

> From: Stephen Leake <stephen_leake@stephe-leake.org>
> Cc: casouri@gmail.com,  cpitclaudel@gmail.com,  monnier@iro.umontreal.ca,
>   emacs-devel@gnu.org
> Date: Sun, 25 Jul 2021 22:10:12 -0700
> 
> >> The Ada allocator throws an exception on allocation fail; is it
> >> sufficient to turn that exception into an elisp signal, and arrange for
> >> elisp to call memory_full (or take some other action, like killing the
> >> parser)?
> >
> > What is a "lisp signal" in this context?
> 
> The module interface layer of wisi.el would do:
> 
>     (signal 'error "parser ran out of memory")

We don't have such an error (and handling an error when you've run out
of memory could backfire).

> >> Another possible reason to change the Ada allocator is if we want to
> >> expose Ada memory pointers directly to elisp, as Yuan Fu wants to do for
> >> tree-sitter (I don't plan to do this for wisi). Does that require that
> >> the pointers be allocated by the same allocator?
> >
> > Same allocator as what?
> 
> As other lisp symbols.

Not sure, perhaps you could free them in a finalizer instead.  If you
want GC to free them, then yes.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-24 15:48                                                             ` Eli Zaretskii
  2021-07-24 17:14                                                               ` Yuan Fu
@ 2021-07-26 14:38                                                               ` Perry E. Metzger
  1 sibling, 0 replies; 284+ messages in thread
From: Perry E. Metzger @ 2021-07-26 14:38 UTC (permalink / raw)
  To: emacs-devel

On 7/24/21 11:48, Eli Zaretskii wrote:
>> From: Yuan Fu <casouri@gmail.com>
>>
>>
>> IIUC if we want tree-sitter to use our malloc, we need to build it with Emacs, where should I put the source of tree-sitter?
> tree-sitter itself should be a library we link against.  If you meant
> the tree-sitter support code, then it should go on a separate file in
> src/.  Or did I misunderstand your question?
>
I suspect that the authors' expectations are that enough things need to 
be tweaked that a given editor project like Emacs probably would want to 
recompile Tree Sitter for use in their system. I'm not 100% sure about 
that, but it seems to be what they're assuming.


Perry





^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-25 19:03                                                       ` Eli Zaretskii
@ 2021-07-26 16:40                                                         ` Yuan Fu
  2021-07-26 16:49                                                           ` Eli Zaretskii
  2021-07-26 23:40                                                           ` Ergus
  0 siblings, 2 replies; 284+ messages in thread
From: Yuan Fu @ 2021-07-26 16:40 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Clément Pit-Claudel, Stephen Leake, Stefan Monnier, emacs-devel

> 
>>>> unless the narrowing is for multi-major-mode.
>>> 
>>> And what would you do in that case, if you allow TS to look beyond the
>>> restriction?
>> 
>> In the multi-major-mode case, there is a separate parser for each
>> language, and each sub-mode region in the text would get its own parser
>> tree (ie, it acts like a separate file), and that parser tree is only
>> told about changes to those regions. So the parser will never try to
>> look outside the region; it doesn't need to know about narrowing.
> 
> Once again, we are talking about the function used by TS to read
> buffer text.  Not about the parser or its caller.  Low-level code,
> which knows nothing about the context, should never look beyond the
> restriction.

It doesn’t harm for tree-sitter to see the rest of the buffer, it doesn’t modify anything, all it does it reading the text. OTOH, restricting tree-sitter to the bounds of narrows adds complexity for no benefit (as far as I can see). Maybe narrowing is the context that low level code should ignore, or at least tree-sitter should ignore. The only benefit that I can think of is “we firmly adhere to the ‘contract’ that no one can look beyond the narrowed region”, but is it a good contract? Is there really a contract in the first place? IMO, narrowing acts like masking tapes over the rest of the buffer, so that user edits like re-replace wouldn’t spill out. Demanding everything in Emacs to not have access to the rest of the buffer is dogmatic (in the sense that it is too rigid and is simply following the doctrine blindly). 

And about language definitions and font-locking, I just realized that tree-sitter language definitions provides highlighting patterns, and we only need to minimally modify them to use them for Emacs, so there aren’t much manual effort involved.

Also, anyone have thoughts on how should tree-sitter intergrate with font-lock beyond the current simple interface?

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-26 16:40                                                         ` Yuan Fu
@ 2021-07-26 16:49                                                           ` Eli Zaretskii
  2021-07-26 17:09                                                             ` Yuan Fu
  2021-07-26 18:32                                                             ` chad
  2021-07-26 23:40                                                           ` Ergus
  1 sibling, 2 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-26 16:49 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Mon, 26 Jul 2021 12:40:31 -0400
> Cc: Stephen Leake <stephen_leake@stephe-leake.org>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  Stefan Monnier <monnier@iro.umontreal.ca>,
>  emacs-devel@gnu.org
> 
> > Once again, we are talking about the function used by TS to read
> > buffer text.  Not about the parser or its caller.  Low-level code,
> > which knows nothing about the context, should never look beyond the
> > restriction.
> 
> It doesn’t harm for tree-sitter to see the rest of the buffer, it doesn’t modify anything, all it does it reading the text. OTOH, restricting tree-sitter to the bounds of narrows adds complexity for no benefit (as far as I can see).

Which complexity does it add?  You just compare with BEGV_BYTE instead
of BEG_BYTE etc.

If we let TS look where it wants, we will lose the ability to restrict
it to a certain part of the buffer text.  This is needed at least for
some specialized modes, and is generally desirable, as it gives Lisp
programs an easy way to impose such restrictions whenever they need.

> Maybe narrowing is the context that low level code should ignore

No other code in Emacs does, and for a good reason.

> The only benefit that I can think of is “we firmly adhere to the ‘contract’ that no one can look beyond the narrowed region”, but is it a good contract? Is there really a contract in the first place?

It served us very well until now, so yes, I think it's a good
contract.

> IMO, narrowing acts like masking tapes over the rest of the buffer, so that user edits like re-replace wouldn’t spill out. Demanding everything in Emacs to not have access to the rest of the buffer is dogmatic (in the sense that it is too rigid and is simply following the doctrine blindly). 

Again, this "dogma" is used and adhered everywhere else in Emacs by
such low-level code.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-26 16:49                                                           ` Eli Zaretskii
@ 2021-07-26 17:09                                                             ` Yuan Fu
  2021-07-26 18:55                                                               ` Eli Zaretskii
  2021-07-27  6:13                                                               ` Stephen Leake
  2021-07-26 18:32                                                             ` chad
  1 sibling, 2 replies; 284+ messages in thread
From: Yuan Fu @ 2021-07-26 17:09 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, Stephen Leake, monnier, emacs-devel

>> 
>>> Once again, we are talking about the function used by TS to read
>>> buffer text.  Not about the parser or its caller.  Low-level code,
>>> which knows nothing about the context, should never look beyond the
>>> restriction.
>> 
>> It doesn’t harm for tree-sitter to see the rest of the buffer, it doesn’t modify anything, all it does it reading the text. OTOH, restricting tree-sitter to the bounds of narrows adds complexity for no benefit (as far as I can see).
> 
> Which complexity does it add?  You just compare with BEGV_BYTE instead
> of BEG_BYTE etc.

We need to “delete” the hidden text and “re-insert” when we widen the buffer. I’ll try to make it a no-op as long as we remember to widen before calling tree-sitter to parse anything.

> 
> If we let TS look where it wants, we will lose the ability to restrict
> it to a certain part of the buffer text.  This is needed at least for
> some specialized modes, and is generally desirable, as it gives Lisp
> programs an easy way to impose such restrictions whenever they need.

Tree-sitter lets you set ranges for a parser to limit it self within, in order to support multi-language files.

> 
>> Maybe narrowing is the context that low level code should ignore
> 
> No other code in Emacs does, and for a good reason.
> 
>> The only benefit that I can think of is “we firmly adhere to the ‘contract’ that no one can look beyond the narrowed region”, but is it a good contract? Is there really a contract in the first place?
> 
> It served us very well until now, so yes, I think it's a good
> contract.
> 
>> IMO, narrowing acts like masking tapes over the rest of the buffer, so that user edits like re-replace wouldn’t spill out. Demanding everything in Emacs to not have access to the rest of the buffer is dogmatic (in the sense that it is too rigid and is simply following the doctrine blindly). 
> 
> Again, this "dogma" is used and adhered everywhere else in Emacs by
> such low-level code.

Ok. I trust you to know better than I do.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-26 16:49                                                           ` Eli Zaretskii
  2021-07-26 17:09                                                             ` Yuan Fu
@ 2021-07-26 18:32                                                             ` chad
  2021-07-26 18:44                                                               ` Perry E. Metzger
  2021-07-26 19:09                                                               ` Eli Zaretskii
  1 sibling, 2 replies; 284+ messages in thread
From: chad @ 2021-07-26 18:32 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Yuan Fu, EMACS development team, Stephen Leake,
	Clément Pit-Claudel, Stefan Monnier

[-- Attachment #1: Type: text/plain, Size: 2141 bytes --]

On Mon, Jul 26, 2021 at 9:49 AM Eli Zaretskii <eliz@gnu.org> wrote:

> > It doesn’t harm for tree-sitter to see the rest of the buffer, it
> doesn’t modify anything, all it does it reading the text. OTOH, restricting
> tree-sitter to the bounds of narrows adds complexity for no benefit (as far
> as I can see).
>
> Which complexity does it add?  You just compare with BEGV_BYTE instead
> of BEG_BYTE etc.
>

In order to exercise many of its over-all benefits, tree-sitter builds and
maintains a parse tree of the whole file, and takes care to modify that
tree in-place with minimized changes. If emacs+ts were to remove everything
outside the narrow and then re-add it each time something temporarily
narrowed a file (say, to enhance mental focus), then emacs+ts would be
(wastefully) throwing away some of the underlying assumptions that makes ts
useful.

Emacs' internals use narrow/widen *and mostly honor them at most levels*
because they are emacs' abstraction for separating parts of a buffer from
other parts. Tree-sitter has a separate abstraction for doing this -- the
developer can have ts use different internal objects for different parts of
the file. This allows editors like Atom (a major influence on ts' original
feature set) to support what emacs would call multiple major modes. (Emacs
could still use some help here, c.f. Alan's proposal for "islands" from a
few years back.)

Using the ts multiple parsers support inside emacs+ts to "handle" narrowing
seems like a strong idea, but there are likely some complexities involved
in "switching" back and forth between the full-file parse and the narrowed
parse, plus making sure that the right parses are updated when the buffer
changes. With that in mind, it might be easier to start with an emacs+ts
prototype that always uses the full-file parse, and then adding the
"sub-parses" later. In that sense, it seems like it's primarily a matter of
what level of itch people want to start scratching when.

 emacs-dev islands discussion:
https://lists.gnu.org/archive/html/emacs-devel/2016-04/msg00585.html

Hope that helps,
~Chad

[-- Attachment #2: Type: text/html, Size: 2687 bytes --]

^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-26 18:32                                                             ` chad
@ 2021-07-26 18:44                                                               ` Perry E. Metzger
  2021-07-26 19:13                                                                 ` Eli Zaretskii
  2021-07-26 19:09                                                               ` Eli Zaretskii
  1 sibling, 1 reply; 284+ messages in thread
From: Perry E. Metzger @ 2021-07-26 18:44 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 2358 bytes --]

On 7/26/21 14:32, chad wrote:
>
> In order to exercise many of its over-all benefits, tree-sitter builds 
> and maintains a parse tree of the whole file, and takes care to modify 
> that tree in-place with minimized changes. If emacs+ts were to remove 
> everything outside the narrow and then re-add it each time something 
> temporarily narrowed a file (say, to enhance mental focus), then 
> emacs+ts would be (wastefully) throwing away some of the 
> underlying assumptions that makes ts useful.
>
> Emacs' internals use narrow/widen *and mostly honor them at most 
> levels* because they are emacs' abstraction for separating parts of a 
> buffer from other parts. Tree-sitter has a separate abstraction for 
> doing this -- the developer can have ts use different internal objects 
> for different parts of the file. This allows editors like Atom (a 
> major influence on ts' original feature set) to support what emacs 
> would call multiple major modes. (Emacs could still use some help 
> here, c.f. Alan's proposal for "islands" from a few years back.)
>
> Using the ts multiple parsers support inside emacs+ts to "handle" 
> narrowing seems like a strong idea, but there are likely some 
> complexities involved in "switching" back and forth between the 
> full-file parse and the narrowed parse, plus making sure that the 
> right parses are updated when the buffer changes. With that in mind, 
> it might be easier to start with an emacs+ts prototype that always 
> uses the full-file parse, and then adding the "sub-parses" later. In 
> that sense, it seems like it's primarily a matter of what level of 
> itch people want to start scratching when.
>
>  emacs-dev islands discussion: 
> https://lists.gnu.org/archive/html/emacs-devel/2016-04/msg00585.html
>
I strongly agree with what's said here. I'll also note that some 
languages will not parse correctly if you narrow to (say) only a block 
or part of a function and not a whole file, and it would be unexpected 
by users to narrow to a part of their code and suddenly have things like 
tree sitter go haywire. If I narrow to a part of a function, I'd like my 
indentation and highlighting to remain correct.

My suggestion is that we at least experiment with allowing Tree Sitter 
to see the whole file or just the narrowed parts, and see what works in 
practice works better.

Perry


[-- Attachment #2: Type: text/html, Size: 3556 bytes --]

^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-26 17:09                                                             ` Yuan Fu
@ 2021-07-26 18:55                                                               ` Eli Zaretskii
  2021-07-26 19:06                                                                 ` Yuan Fu
  2021-07-27  6:13                                                               ` Stephen Leake
  1 sibling, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-26 18:55 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Mon, 26 Jul 2021 13:09:13 -0400
> Cc: Stephen Leake <stephen_leake@stephe-leake.org>,
>  cpitclaudel@gmail.com,
>  monnier@iro.umontreal.ca,
>  emacs-devel@gnu.org
> 
> > Which complexity does it add?  You just compare with BEGV_BYTE instead
> > of BEG_BYTE etc.
> 
> We need to “delete” the hidden text and “re-insert” when we widen the buffer. I’ll try to make it a no-op as long as we remember to widen before calling tree-sitter to parse anything.

If some parser needs access to the whole buffer, its caller should
widen the buffer before calling the parser.

IOW, the control on which part of the buffer is visible to the parser
should be on the level of the caller of the parser, not at the level
of the function which accesses buffer text.

> > If we let TS look where it wants, we will lose the ability to restrict
> > it to a certain part of the buffer text.  This is needed at least for
> > some specialized modes, and is generally desirable, as it gives Lisp
> > programs an easy way to impose such restrictions whenever they need.
> 
> Tree-sitter lets you set ranges for a parser to limit it self within, in order to support multi-language files.

That's okay, but why would we want to expose this to Lisp as the means
to restrict the accessible portion, when we already have such a means?



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-26 18:55                                                               ` Eli Zaretskii
@ 2021-07-26 19:06                                                                 ` Yuan Fu
  2021-07-26 19:19                                                                   ` Perry E. Metzger
  2021-07-26 19:20                                                                   ` Eli Zaretskii
  0 siblings, 2 replies; 284+ messages in thread
From: Yuan Fu @ 2021-07-26 19:06 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, Stephen Leake, monnier, emacs-devel

> 
>>> If we let TS look where it wants, we will lose the ability to restrict
>>> it to a certain part of the buffer text.  This is needed at least for
>>> some specialized modes, and is generally desirable, as it gives Lisp
>>> programs an easy way to impose such restrictions whenever they need.
>> 
>> Tree-sitter lets you set ranges for a parser to limit it self within, in order to support multi-language files.
> 
> That's okay, but why would we want to expose this to Lisp as the means
> to restrict the accessible portion, when we already have such a means?

Tree-sitter lets you set multiple discontinuous ranges, whereas narrowing can only narrow to a single continuous range. Multiple discontinuous range is much more useful for HTML+CSS+JS, or PHP + HML.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-26 18:32                                                             ` chad
  2021-07-26 18:44                                                               ` Perry E. Metzger
@ 2021-07-26 19:09                                                               ` Eli Zaretskii
  2021-07-26 19:48                                                                 ` chad
  1 sibling, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-26 19:09 UTC (permalink / raw)
  To: chad; +Cc: casouri, emacs-devel, stephen_leake, cpitclaudel, monnier

> From: chad <yandros@gmail.com>
> Date: Mon, 26 Jul 2021 11:32:23 -0700
> Cc: Yuan Fu <casouri@gmail.com>, Clément Pit-Claudel <cpitclaudel@gmail.com>, 
> 	Stephen Leake <stephen_leake@stephe-leake.org>, Stefan Monnier <monnier@iro.umontreal.ca>, 
> 	EMACS development team <emacs-devel@gnu.org>
> 
> In order to exercise many of its over-all benefits, tree-sitter builds and maintains a parse tree of the whole
> file, and takes care to modify that tree in-place with minimized changes. If emacs+ts were to remove
> everything outside the narrow and then re-add it each time something temporarily narrowed a file (say, to
> enhance mental focus), then emacs+ts would be (wastefully) throwing away some of the underlying
> assumptions that makes ts useful.
> 
> Emacs' internals use narrow/widen *and mostly honor them at most levels* because they are emacs'
> abstraction for separating parts of a buffer from other parts. Tree-sitter has a separate abstraction for doing
> this -- the developer can have ts use different internal objects for different parts of the file. This allows editors
> like Atom (a major influence on ts' original feature set) to support what emacs would call multiple major
> modes. (Emacs could still use some help here, c.f. Alan's proposal for "islands" from a few years back.) 

We are mis-communicating.  The issue is not _whether_ to allow TS to
access most or all of the buffer text, the issue is _on_what_level_
should this be controlled.  All I'm saying is that the right level is
NOT the function which accesses buffer text, the right level is
higher.  At that higher level, if some parser (or even almost every
parser) needs to access the entire buffer, some code should call
'widen'.

By contrast, if the text-reading function always treats the buffer as
widened, we will never be able to invoke a TS parser on a portion of
the text, something that is needed by specialized features.  It makes
no sense to require those features to start using TS-specific means of
restricting access to portions of the buffer to that effect, when a
simple restriction is good enough and is already being used.

> Using the ts multiple parsers support inside emacs+ts to "handle" narrowing seems like a strong idea, but
> there are likely some complexities involved in "switching" back and forth between the full-file parse and the
> narrowed parse, plus making sure that the right parses are updated when the buffer changes. With that in
> mind, it might be easier to start with an emacs+ts prototype that always uses the full-file parse, and then
> adding the "sub-parses" later.

I disagree.  The cost of having the text-reading function look only
inside the restriction is very small: a single call to 'widen' in the
caller.  The cost of having that function ignore the restriction is
much higher.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-26 18:44                                                               ` Perry E. Metzger
@ 2021-07-26 19:13                                                                 ` Eli Zaretskii
  0 siblings, 0 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-26 19:13 UTC (permalink / raw)
  To: Perry E. Metzger; +Cc: emacs-devel

> Date: Mon, 26 Jul 2021 14:44:24 -0400
> From: "Perry E. Metzger" <perry@piermont.com>
> 
> My suggestion is that we at least experiment with allowing Tree Sitter to see the whole file or just the
> narrowed parts, and see what works in practice works better.

No one suggested anything to the contrary.  The callers of TS which
need it to access the entire buffer should call 'widen', and that's
it.

This is a tempest in a teapot.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-26 19:06                                                                 ` Yuan Fu
@ 2021-07-26 19:19                                                                   ` Perry E. Metzger
  2021-07-26 19:31                                                                     ` Eli Zaretskii
  2021-07-26 19:20                                                                   ` Eli Zaretskii
  1 sibling, 1 reply; 284+ messages in thread
From: Perry E. Metzger @ 2021-07-26 19:19 UTC (permalink / raw)
  To: emacs-devel


On 7/26/21 15:06, Yuan Fu wrote:
> Tree-sitter lets you set multiple discontinuous ranges, whereas 
> narrowing can only narrow to a single continuous range. Multiple 
> discontinuous range is much more useful for HTML+CSS+JS, or PHP + HML. 


Other obvious uses: restructured text or markdown documentation amidst 
code in another language, various sorts of literate programming, etc.

(This of course brings up that someday it might be nice to have Emacs 
aware of such multi-modal text and able to switch how you're editing 
even inside a single file, but that's a bigger topic.)


Perry




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-26 19:06                                                                 ` Yuan Fu
  2021-07-26 19:19                                                                   ` Perry E. Metzger
@ 2021-07-26 19:20                                                                   ` Eli Zaretskii
  2021-07-26 19:45                                                                     ` Yuan Fu
  1 sibling, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-26 19:20 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Mon, 26 Jul 2021 15:06:14 -0400
> Cc: Stephen Leake <stephen_leake@stephe-leake.org>,
>  cpitclaudel@gmail.com,
>  monnier@iro.umontreal.ca,
>  emacs-devel@gnu.org
> 
> > 
> >>> If we let TS look where it wants, we will lose the ability to restrict
> >>> it to a certain part of the buffer text.  This is needed at least for
> >>> some specialized modes, and is generally desirable, as it gives Lisp
> >>> programs an easy way to impose such restrictions whenever they need.
> >> 
> >> Tree-sitter lets you set ranges for a parser to limit it self within, in order to support multi-language files.
> > 
> > That's okay, but why would we want to expose this to Lisp as the means
> > to restrict the accessible portion, when we already have such a means?
> 
> Tree-sitter lets you set multiple discontinuous ranges, whereas narrowing can only narrow to a single continuous range. Multiple discontinuous range is much more useful for HTML+CSS+JS, or PHP + HML.

I understand.  But forcing various Emacs features to use these ranges
where a simple restriction will do makes little sense.

Last time something like these discontinuous ranges was discussed as a
general feature in Emacs, we couldn't come up with an agreed-upon
design and implementation.  So adding something like that to Emacs is
not an easy job.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-26 19:19                                                                   ` Perry E. Metzger
@ 2021-07-26 19:31                                                                     ` Eli Zaretskii
  0 siblings, 0 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-26 19:31 UTC (permalink / raw)
  To: Perry E. Metzger; +Cc: emacs-devel

> Date: Mon, 26 Jul 2021 15:19:20 -0400
> From: "Perry E. Metzger" <perry@piermont.com>
> 
> Other obvious uses: restructured text or markdown documentation amidst 
> code in another language, various sorts of literate programming, etc.

We should, of course, support these features.  But their support
should be controlled by Lisp programs, not be hard-coded in some
low-level C code.  The way to access discontinuous ranges of buffer
text as a single character sequence needs support in Emacs Lisp before
we can map it to the equivalent TS features.

> (This of course brings up that someday it might be nice to have Emacs 
> aware of such multi-modal text and able to switch how you're editing 
> even inside a single file, but that's a bigger topic.)

We have the beginning of this, but have a lot more turf to cover.  And
currently, what we have uses restrictions.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-26 19:20                                                                   ` Eli Zaretskii
@ 2021-07-26 19:45                                                                     ` Yuan Fu
  2021-07-26 19:57                                                                       ` Dmitry Gutov
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-07-26 19:45 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, Stephen Leake, monnier, emacs-devel


>>>>> If we let TS look where it wants, we will lose the ability to restrict
>>>>> it to a certain part of the buffer text.  This is needed at least for
>>>>> some specialized modes, and is generally desirable, as it gives Lisp
>>>>> programs an easy way to impose such restrictions whenever they need.
>>>> 
>>>> Tree-sitter lets you set ranges for a parser to limit it self within, in order to support multi-language files.
>>> 
>>> That's okay, but why would we want to expose this to Lisp as the means
>>> to restrict the accessible portion, when we already have such a means?
>> 
>> Tree-sitter lets you set multiple discontinuous ranges, whereas narrowing can only narrow to a single continuous range. Multiple discontinuous range is much more useful for HTML+CSS+JS, or PHP + HML.
> 
> I understand.  But forcing various Emacs features to use these ranges
> where a simple restriction will do makes little sense.
> 
> Last time something like these discontinuous ranges was discussed as a
> general feature in Emacs, we couldn't come up with an agreed-upon
> design and implementation.  So adding something like that to Emacs is
> not an easy job.

We can provide both. Those who needs the more powerful ranges could use that, and those who don’t can use narrowing.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-26 19:09                                                               ` Eli Zaretskii
@ 2021-07-26 19:48                                                                 ` chad
  2021-07-26 20:05                                                                   ` Óscar Fuentes
  2021-07-27 13:59                                                                   ` Eli Zaretskii
  0 siblings, 2 replies; 284+ messages in thread
From: chad @ 2021-07-26 19:48 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Yuan Fu, EMACS development team, Stephen Leake,
	Clément Pit-Claudel, Stefan Monnier

[-- Attachment #1: Type: text/plain, Size: 998 bytes --]

I think I understand your point, and I agree that it would be ill-advised
to remove the ability to change the "scope" in question from lisp's
control. What I'm trying to say (and I think Yuan Fu is also suggesting) is
that while emacs necessarily has *one* view of the buffer, narrowed or not,
tree-sitter might want to maintain multiple trees of that buffer, with the
default being the same as emacs' widened view, and narrowed views being
separate parse trees created as needed. I'm suggesting this as an
alternative to having emacs+ts effectively throw away most of the parse
tree every time the user narrows, then have to re-build it on each widen.

In other words, I think this might be a communication issue about the
default trade-off behavior: generally keep a fully-widened tree and
create/use narrowed trees as needed, versus generally keeping only a tree
that matches whatever a higher-level view of the buffer might give at any
one moment, rebuilding each time.

Hope that helps,
~Chad

[-- Attachment #2: Type: text/html, Size: 1104 bytes --]

^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-26 19:45                                                                     ` Yuan Fu
@ 2021-07-26 19:57                                                                       ` Dmitry Gutov
  0 siblings, 0 replies; 284+ messages in thread
From: Dmitry Gutov @ 2021-07-26 19:57 UTC (permalink / raw)
  To: Yuan Fu, Eli Zaretskii; +Cc: cpitclaudel, Stephen Leake, monnier, emacs-devel

On 26.07.2021 22:45, Yuan Fu wrote:
>> Last time something like these discontinuous ranges was discussed as a
>> general feature in Emacs, we couldn't come up with an agreed-upon
>> design and implementation.  So adding something like that to Emacs is
>> not an easy job.
> We can provide both. Those who needs the more powerful ranges could use that, and those who don’t can use narrowing.

If one wanted to continue where the previous discussions stopped, we 
tentatively decided that the variable prog-indentation-context could 
help. I.e. when some multiple-major-mode framework wanted to tell the 
current major mode that there are more "ranges" of the same mode in the 
buffer, it would bind prog-indentation-context to some particular value.

It's very much "to be discussed later", but the second element of 
prog-indentation-context can be a list of those ranges, or, more likely, 
a functions that produces that list.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-26 19:48                                                                 ` chad
@ 2021-07-26 20:05                                                                   ` Óscar Fuentes
  2021-07-26 21:30                                                                     ` Clément Pit-Claudel
  2021-07-27 14:02                                                                     ` Eli Zaretskii
  2021-07-27 13:59                                                                   ` Eli Zaretskii
  1 sibling, 2 replies; 284+ messages in thread
From: Óscar Fuentes @ 2021-07-26 20:05 UTC (permalink / raw)
  To: emacs-devel

chad <yandros@gmail.com> writes:

> I think I understand your point, and I agree that it would be ill-advised
> to remove the ability to change the "scope" in question from lisp's
> control. What I'm trying to say (and I think Yuan Fu is also suggesting) is
> that while emacs necessarily has *one* view of the buffer, narrowed or not,
> tree-sitter might want to maintain multiple trees of that buffer, with the
> default being the same as emacs' widened view, and narrowed views being
> separate parse trees created as needed. I'm suggesting this as an
> alternative to having emacs+ts effectively throw away most of the parse
> tree every time the user narrows, then have to re-build it on each widen.
>
> In other words, I think this might be a communication issue about the
> default trade-off behavior: generally keep a fully-widened tree and
> create/use narrowed trees as needed, versus generally keeping only a tree
> that matches whatever a higher-level view of the buffer might give at any
> one moment, rebuilding each time.

IIUC this is not about the user doing M-x narrow-to-defun or somesuch,
we can agree that (in general) the right thing to do for TS is to keep
using the whole buffer.

I think Eli is trying to control what TS sees because doing so would
make possible some features and/or simplify implementing them. Think of
an Org file with some code blocks. It makes no sense to expose the whole
Org file to TS and, I guess, it would just complicate things for no
benefit. On that scenario, it might make sense to deal with the code
blocks as independent entities instead of parts of something else.




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-26 20:05                                                                   ` Óscar Fuentes
@ 2021-07-26 21:30                                                                     ` Clément Pit-Claudel
  2021-07-26 21:46                                                                       ` Óscar Fuentes
  2021-07-27 14:02                                                                     ` Eli Zaretskii
  1 sibling, 1 reply; 284+ messages in thread
From: Clément Pit-Claudel @ 2021-07-26 21:30 UTC (permalink / raw)
  To: emacs-devel

On 7/26/21 4:05 PM, Óscar Fuentes wrote:
> Think of an Org file with some code blocks. It makes no sense to
> expose the whole Org file to TS and, I guess, it would just
> complicate things for no benefit. On that scenario, it might make
> sense to deal with the code blocks as independent entities instead of
> parts of something else.

Isn't this example actually in favor of *not* narrowing before giving the buffer to TS? Consecutive org code blocks often build upon each other, so you'd want to give the whole buffer to TS and restrict its analysis to just the code blocks (multiple disjoint ranges), a results that you couldn't achieve with narrowing.





^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-26 21:30                                                                     ` Clément Pit-Claudel
@ 2021-07-26 21:46                                                                       ` Óscar Fuentes
  0 siblings, 0 replies; 284+ messages in thread
From: Óscar Fuentes @ 2021-07-26 21:46 UTC (permalink / raw)
  To: emacs-devel

Clément Pit-Claudel <cpitclaudel@gmail.com> writes:

> On 7/26/21 4:05 PM, Óscar Fuentes wrote:
>> Think of an Org file with some code blocks. It makes no sense to
>> expose the whole Org file to TS and, I guess, it would just
>> complicate things for no benefit. On that scenario, it might make
>> sense to deal with the code blocks as independent entities instead of
>> parts of something else.
>
> Isn't this example actually in favor of *not* narrowing before giving
> the buffer to TS? Consecutive org code blocks often build upon each
> other, so you'd want to give the whole buffer to TS and restrict its
> analysis to just the code blocks (multiple disjoint ranges), a results
> that you couldn't achieve with narrowing.

I don't know from where you got your "often" :-) Of course it is a
possibility, but I've seen plenty of Org files containing loosely
related code snippets (and those who build on each other tend to be
written on dynamic languages, which benefit a lot less from static code
analysis.)

As far as my personal experience goes, I very much prefer that each
block is treated indepently, because my Org files contain code recipes
for specialized tasks, bug reproducers, multiple variations of
experiments, etc. They are not related at all.

So it is desirable to support both modes well.




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-26 16:40                                                         ` Yuan Fu
  2021-07-26 16:49                                                           ` Eli Zaretskii
@ 2021-07-26 23:40                                                           ` Ergus
  2021-07-27 14:49                                                             ` Yuan Fu
  1 sibling, 1 reply; 284+ messages in thread
From: Ergus @ 2021-07-26 23:40 UTC (permalink / raw)
  To: Yuan Fu
  Cc: Eli Zaretskii, Clément Pit-Claudel, Stephen Leake,
	Stefan Monnier, emacs-devel

On Mon, Jul 26, 2021 at 12:40:31PM -0400, Yuan Fu wrote:
>>
>>>>> unless the narrowing is for multi-major-mode.
>>>>
>>>> And what would you do in that case, if you allow TS to look beyond the
>>>> restriction?
>>>
>>> In the multi-major-mode case, there is a separate parser for each
>>> language, and each sub-mode region in the text would get its own parser
>>> tree (ie, it acts like a separate file), and that parser tree is only
>>> told about changes to those regions. So the parser will never try to
>>> look outside the region; it doesn't need to know about narrowing.
>>
>> Once again, we are talking about the function used by TS to read
>> buffer text.  Not about the parser or its caller.  Low-level code,
>> which knows nothing about the context, should never look beyond the
>> restriction.
>
>It doesn’t harm for tree-sitter to see the rest of the buffer, it
>doesn’t modify anything, all it does it reading the text. OTOH,
>restricting tree-sitter to the bounds of narrows adds complexity for no
>benefit (as far as I can see). Maybe narrowing is the context that low
>level code should ignore, or at least tree-sitter should ignore. The
>only benefit that I can think of is “we firmly adhere to the ‘contract’
>that no one can look beyond the narrowed region”, but is it a good
>contract? Is there really a contract in the first place? IMO, narrowing
>acts like masking tapes over the rest of the buffer, so that user edits
>like re-replace wouldn’t spill out. Demanding everything in Emacs to
>not have access to the rest of the buffer is dogmatic (in the sense
>that it is too rigid and is simply following the doctrine blindly).
>
Hi Yuan:

 From my absolute ignorance on tree_sitter and your changes. There is a
function ts_parser_set_included_ranges that is a way I used once to
reduce the parsing region and improve (notably) the performance in a
test api.

Can't narrow regions use that? I think it is the same idea but I am
probably wrong.

Limiting the region to parse to the modified region (that in emacs may
be known thanks to the gap and maybe the undo-tree) and using the output
tree from the previous parse as the `old_tree` parameter in
ts_parser_parse_string made tree_sitter incredibly fast in my case (and
useful to run it on every key press).

In my case using old_tree reduced the time by a factor of 10 in a big
source file; and limiting the parser to the "changed" region only made
it almost instantly in more than 80% of the executions with small
modifications. (I repeat; it was a much simpler use case)

>And about language definitions and font-locking, I just realized that
>tree-sitter language definitions provides highlighting patterns, and we
>only need to minimally modify them to use them for Emacs, so there
>aren’t much manual effort involved.
>
I think tree-sitter has many more language definitions than Emacs in
some languages, and probably we may want to properly support them. So
maybe: instead of just modifying what is on tree-sitter to make it
similar to what emacs currently has; we could just use the node's
syntactic information and then let emacs use it adding more faces if
needed... Does it makes sense?

The idea is to have real syntactic information on the text itself
because that may help in the future to implement indentation and
navigation commands using three-sitter's information (commands like
up-list or forward-sexp) will be the equivalent to
ts_tree_cursor_goto_parent or ts_tree_cursor_goto_next_sibling.

>Also, anyone have thoughts on how should tree-sitter intergrate with
>font-lock beyond the current simple interface?
>
No idea, but in my experience the most efficient way to traverse a
tree-sitter tree is with ts_tree_cursor but maybe for font-lock the best
is just to use ts_tree_get_changed_ranges.

>Yuan

Best,
Ergus



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-26 17:09                                                             ` Yuan Fu
  2021-07-26 18:55                                                               ` Eli Zaretskii
@ 2021-07-27  6:13                                                               ` Stephen Leake
  2021-07-27 14:56                                                                 ` Yuan Fu
  1 sibling, 1 reply; 284+ messages in thread
From: Stephen Leake @ 2021-07-27  6:13 UTC (permalink / raw)
  To: Yuan Fu; +Cc: Eli Zaretskii, emacs-devel, cpitclaudel, monnier

Yuan Fu <casouri@gmail.com> writes:

>>> 
>>>> Once again, we are talking about the function used by TS to read
>>>> buffer text.  Not about the parser or its caller.  Low-level code,
>>>> which knows nothing about the context, should never look beyond the
>>>> restriction.
>>> 
>>> It doesn’t harm for tree-sitter to see the rest of the buffer, it
>>> doesn’t modify anything, all it does it reading the text. OTOH,
>>> restricting tree-sitter to the bounds of narrows adds complexity
>>> for no benefit (as far as I can see).
>> 
>> Which complexity does it add?  You just compare with BEGV_BYTE instead
>> of BEG_BYTE etc.
>
> We need to “delete” the hidden text and “re-insert” when we widen the
> buffer. I’ll try to make it a no-op as long as we remember to widen
> before calling tree-sitter to parse anything.

First, the only thing TS deletes is tree nodes, not text; it does not
have a copy of the buffer.

Why do you think we need to delete the tree nodes corresponding to the
hidden text? They provide exactly the context needed to parse the
visible text properly.

This assumes the narrowing is temporary, not for a multi-major-mode.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-26 19:48                                                                 ` chad
  2021-07-26 20:05                                                                   ` Óscar Fuentes
@ 2021-07-27 13:59                                                                   ` Eli Zaretskii
  1 sibling, 0 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-27 13:59 UTC (permalink / raw)
  To: chad; +Cc: casouri, emacs-devel, stephen_leake, cpitclaudel, monnier

> From: chad <yandros@gmail.com>
> Date: Mon, 26 Jul 2021 12:48:00 -0700
> Cc: Yuan Fu <casouri@gmail.com>, Clément Pit-Claudel <cpitclaudel@gmail.com>, 
> 	Stephen Leake <stephen_leake@stephe-leake.org>, Stefan Monnier <monnier@iro.umontreal.ca>, 
> 	EMACS development team <emacs-devel@gnu.org>
> 
> I think I understand your point, and I agree that it would be ill-advised to remove the ability to change the
> "scope" in question from lisp's control. What I'm trying to say (and I think Yuan Fu is also suggesting) is that
> while emacs necessarily has *one* view of the buffer, narrowed or not, tree-sitter might want to maintain
> multiple trees of that buffer, with the default being the same as emacs' widened view, and narrowed views
> being separate parse trees created as needed.

Lisp programs which use TS in a way that causes TS to store such
multiple views will have to widen the buffer at strategic places (when
TS needs access to buffer text).

> I'm suggesting this as an alternative to having emacs+ts
> effectively throw away most of the parse tree every time the user narrows, then have to re-build it on each
> widen.

No one says that when Emacs narrows a buffer for some reason, we need
to communicate that immediately to TS.  If the restriction is
ephemeral, it will most probably be lifted by the time we need to
update TS with the editing changes.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-26 20:05                                                                   ` Óscar Fuentes
  2021-07-26 21:30                                                                     ` Clément Pit-Claudel
@ 2021-07-27 14:02                                                                     ` Eli Zaretskii
  1 sibling, 0 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-27 14:02 UTC (permalink / raw)
  To: Óscar Fuentes; +Cc: emacs-devel

> From: Óscar Fuentes <ofv@wanadoo.es>
> Date: Mon, 26 Jul 2021 22:05:58 +0200
> 
> I think Eli is trying to control what TS sees because doing so would
> make possible some features and/or simplify implementing them.

Yes, exactly.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-26 23:40                                                           ` Ergus
@ 2021-07-27 14:49                                                             ` Yuan Fu
  2021-07-27 16:50                                                               ` Ergus
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-07-27 14:49 UTC (permalink / raw)
  To: Ergus
  Cc: Eli Zaretskii, emacs-devel, Stephen Leake,
	Clément Pit-Claudel, Stefan Monnier

> 
> From my absolute ignorance on tree_sitter and your changes. There is a
> function ts_parser_set_included_ranges that is a way I used once to
> reduce the parsing region and improve (notably) the performance in a
> test api.
> 
> Can't narrow regions use that? I think it is the same idea but I am
> probably wrong.

We could use ts_parser_set_included_ranges to implement narrowing, but that would limit the usefulness of ts_parser_set_included_ranges: ts_parser_set_included_ranges allows us to set multiple discontinuous ranges, and narrowing only allows us to narrow to a single continuous range. Therefore I’d like to expose ts_parser_set_included_ranges in a separate function.

> 
> Limiting the region to parse to the modified region (that in emacs may
> be known thanks to the gap and maybe the undo-tree) and using the output
> tree from the previous parse as the `old_tree` parameter in
> ts_parser_parse_string made tree_sitter incredibly fast in my case (and
> useful to run it on every key press).

Interesting, the official documentation doesn’t mention that trick. It only tells me to re-parse with the old tree. If I limit the range to the modified region before re-parse, re-parse, do I get the tree for the entire buffer, or do I only get the tree of the limited range?

> 
> In my case using old_tree reduced the time by a factor of 10 in a big
> source file; and limiting the parser to the "changed" region only made
> it almost instantly in more than 80% of the executions with small
> modifications. (I repeat; it was a much simpler use case)
> 
>> And about language definitions and font-locking, I just realized that
>> tree-sitter language definitions provides highlighting patterns, and we
>> only need to minimally modify them to use them for Emacs, so there
>> aren’t much manual effort involved.
>> 
> I think tree-sitter has many more language definitions than Emacs in
> some languages, and probably we may want to properly support them. So
> maybe: instead of just modifying what is on tree-sitter to make it
> similar to what emacs currently has; we could just use the node's
> syntactic information and then let emacs use it adding more faces if
> needed... Does it makes sense?

The current code does the latter, if I understand you correctly.

> The idea is to have real syntactic information on the text itself
> because that may help in the future to implement indentation and
> navigation commands using three-sitter's information (commands like
> up-list or forward-sexp) will be the equivalent to
> ts_tree_cursor_goto_parent or ts_tree_cursor_goto_next_sibling.

You mean adding syntactic information to the text as text properties? That’s an interesting idea, maybe that’s easier to use than using tree-sitter’s api.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-27  6:13                                                               ` Stephen Leake
@ 2021-07-27 14:56                                                                 ` Yuan Fu
  2021-07-28  3:40                                                                   ` Stephen Leake
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-07-27 14:56 UTC (permalink / raw)
  To: Stephen Leake
  Cc: Eli Zaretskii, Clément Pit-Claudel, monnier, emacs-devel

>> 
>> We need to “delete” the hidden text and “re-insert” when we widen the
>> buffer. I’ll try to make it a no-op as long as we remember to widen
>> before calling tree-sitter to parse anything.
> 
> First, the only thing TS deletes is tree nodes, not text; it does not
> have a copy of the buffer.
> 
> Why do you think we need to delete the tree nodes corresponding to the
> hidden text? They provide exactly the context needed to parse the
> visible text properly.

I don’t think we need to, but I assume that tree-sitter will delete the corresponding nodes if we hide the text from it. For us, the text is there, just hidden; for tree-sitter, the text is deleted. 

Yuan




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-27 14:49                                                             ` Yuan Fu
@ 2021-07-27 16:50                                                               ` Ergus
  2021-07-27 16:59                                                                 ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Ergus @ 2021-07-27 16:50 UTC (permalink / raw)
  To: Yuan Fu
  Cc: Eli Zaretskii, Clément Pit-Claudel, Stephen Leake,
	Stefan Monnier, emacs-devel

On Tue, Jul 27, 2021 at 10:49:44AM -0400, Yuan Fu wrote:

>
>We could use ts_parser_set_included_ranges to implement narrowing, but
>that would limit the usefulness of ts_parser_set_included_ranges:
>ts_parser_set_included_ranges allows us to set multiple discontinuous
>ranges, and narrowing only allows us to narrow to a single continuous
>range.

I agree here.

>Therefore I’d like to expose ts_parser_set_included_ranges in a
>separate function.
>
Exposed in the lisp side could make sense then. 


>
>Interesting, the official documentation doesn’t mention that trick. It
>only tells me to re-parse with the old tree. If I limit the range to
>the modified region before re-parse, re-parse, do I get the tree for
>the entire buffer, or do I only get the tree of the limited range?
>
It worked for me; but it was a much simpler use case; maybe in the
general case it breaks. I think the only way to know is to try it.

Any way the official documentation suggests to use
ts_tree_get_changed_ranges.

>
>The current code does the latter, if I understand you correctly.
>
>
>You mean adding syntactic information to the text as text properties?
>That’s an interesting idea, maybe that’s easier to use than using
>tree-sitter’s api.
>
I think that was the initial Eli's idea when this topic came out. But
maybe I understood it wrongly.

Theoretically in a re-parse doing ts_tree_get_changed_ranges will give
the list of changes needed in the whole text, so updating properties
there may be simpler and cheap (even when they are not in the visible
part of the buffer).

Also, any action that doesn't modify the text (scrolling, moving the
cursor, windows split/resize) won't call any tree-sitter and redisplay
could handle almost everything easily on the beginning.

The only concern here may be that adding properties to the entire text
may be memory consuming. Or maybe that this could overlap part of the
font-lock functionality.

But probably Eli can make a more accurate critic of this idea..

>Yuan

Very thanks for doing this!
Ergus.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-27 16:50                                                               ` Ergus
@ 2021-07-27 16:59                                                                 ` Eli Zaretskii
  2021-07-28  3:45                                                                   ` Stephen Leake
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-27 16:59 UTC (permalink / raw)
  To: Ergus; +Cc: casouri, emacs-devel, stephen_leake, cpitclaudel, monnier

> Date: Tue, 27 Jul 2021 18:50:40 +0200
> From: Ergus <spacibba@aol.com>
> Cc: Eli Zaretskii <eliz@gnu.org>,
> 	Clément Pit-Claudel <cpitclaudel@gmail.com>,
> 	Stephen Leake <stephen_leake@stephe-leake.org>,
> 	Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel@gnu.org
> 
> On Tue, Jul 27, 2021 at 10:49:44AM -0400, Yuan Fu wrote:
> 
> >You mean adding syntactic information to the text as text properties?
> >That’s an interesting idea, maybe that’s easier to use than using
> >tree-sitter’s api.
> >
> I think that was the initial Eli's idea when this topic came out. But
> maybe I understood it wrongly.
> 
> Theoretically in a re-parse doing ts_tree_get_changed_ranges will give
> the list of changes needed in the whole text, so updating properties
> there may be simpler and cheap (even when they are not in the visible
> part of the buffer).
> 
> Also, any action that doesn't modify the text (scrolling, moving the
> cursor, windows split/resize) won't call any tree-sitter and redisplay
> could handle almost everything easily on the beginning.
> 
> The only concern here may be that adding properties to the entire text
> may be memory consuming. Or maybe that this could overlap part of the
> font-lock functionality.
> 
> But probably Eli can make a more accurate critic of this idea..

Storing the syntactic information as text properties has definite
advantages: easy access, use of well-known Emacs Lisp features, etc.
I don't feel I know enough about this use of the properties to have a
definitive opinion, though.  We should probably try that.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-27 14:56                                                                 ` Yuan Fu
@ 2021-07-28  3:40                                                                   ` Stephen Leake
  2021-07-28 16:36                                                                     ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Stephen Leake @ 2021-07-28  3:40 UTC (permalink / raw)
  To: Yuan Fu; +Cc: Eli Zaretskii, Clément Pit-Claudel, monnier, emacs-devel

Yuan Fu <casouri@gmail.com> writes:

>>> 
>>> We need to “delete” the hidden text and “re-insert” when we widen the
>>> buffer. I’ll try to make it a no-op as long as we remember to widen
>>> before calling tree-sitter to parse anything.
>> 
>> First, the only thing TS deletes is tree nodes, not text; it does not
>> have a copy of the buffer.
>> 
>> Why do you think we need to delete the tree nodes corresponding to the
>> hidden text? They provide exactly the context needed to parse the
>> visible text properly.
>
> I don’t think we need to, but I assume that tree-sitter will delete
> the corresponding nodes if we hide the text from it. 

No, tree-sitter only deletes nodes that cover changes.

So don't send a change that deletes the hidden text; just send changes
in the visible part of the text (that's the only place the user can make
changes). tree-sitter will only run the scanner on the change regions,
so it will only request text from the visible part of the buffer;
all the requests will succeed.

> For us, the text is there, just hidden; for tree-sitter, the text is
> deleted.

No, it simply won't notice that it can't access that part of the buffer,
because it will never try.

What, exactly, will the buffer-text fetch code do if tree-sitter
violates the narrowing (by some error in tree-sitter or user code)?
throw an exception? return a null string?

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-27 16:59                                                                 ` Eli Zaretskii
@ 2021-07-28  3:45                                                                   ` Stephen Leake
  0 siblings, 0 replies; 284+ messages in thread
From: Stephen Leake @ 2021-07-28  3:45 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Ergus, emacs-devel, cpitclaudel, casouri, monnier

Eli Zaretskii <eliz@gnu.org> writes:

> Storing the syntactic information as text properties has definite
> advantages: easy access, use of well-known Emacs Lisp features, etc.
> I don't feel I know enough about this use of the properties to have a
> definitive opinion, though.  We should probably try that.

ada-mode does this now, via wisi. It marks each name that might be used
for cross-reference with a "name" property; the start and end of each
procedure and statement with "start/end" properties, so it is easy to
jump there; other similar things.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-28  3:40                                                                   ` Stephen Leake
@ 2021-07-28 16:36                                                                     ` Yuan Fu
  2021-07-28 16:41                                                                       ` Eli Zaretskii
  2021-07-28 16:43                                                                       ` Eli Zaretskii
  0 siblings, 2 replies; 284+ messages in thread
From: Yuan Fu @ 2021-07-28 16:36 UTC (permalink / raw)
  To: Stephen Leake
  Cc: Eli Zaretskii, Clément Pit-Claudel, monnier, emacs-devel



> On Jul 27, 2021, at 11:40 PM, Stephen Leake <stephen_leake@stephe-leake.org> wrote:
> 
> Yuan Fu <casouri@gmail.com> writes:
> 
>>>> 
>>>> We need to “delete” the hidden text and “re-insert” when we widen the
>>>> buffer. I’ll try to make it a no-op as long as we remember to widen
>>>> before calling tree-sitter to parse anything.
>>> 
>>> First, the only thing TS deletes is tree nodes, not text; it does not
>>> have a copy of the buffer.
>>> 
>>> Why do you think we need to delete the tree nodes corresponding to the
>>> hidden text? They provide exactly the context needed to parse the
>>> visible text properly.
>> 
>> I don’t think we need to, but I assume that tree-sitter will delete
>> the corresponding nodes if we hide the text from it. 
> 
> No, tree-sitter only deletes nodes that cover changes.
> 
> So don't send a change that deletes the hidden text; just send changes
> in the visible part of the text (that's the only place the user can make
> changes). tree-sitter will only run the scanner on the change regions,
> so it will only request text from the visible part of the buffer;
> all the requests will succeed.

Then we are not hiding the hidden text from tree-sitter. The implementation you described, IIUC, is essentially do nothing special when the buffer is narrowed. 

> 
>> For us, the text is there, just hidden; for tree-sitter, the text is
>> deleted.
> 
> No, it simply won't notice that it can't access that part of the buffer,
> because it will never try.
> 
> What, exactly, will the buffer-text fetch code do if tree-sitter
> violates the narrowing (by some error in tree-sitter or user code)?
> throw an exception? return a null string?

In my current implementation, if tree-sitter access buffer content outside of narrowed region, it reads whitespaces, if it access buffer content outside of the buffer, it reads null string. Neither case throws an error.

Yuan





^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-28 16:36                                                                     ` Yuan Fu
@ 2021-07-28 16:41                                                                       ` Eli Zaretskii
  2021-07-29 22:58                                                                         ` Stephen Leake
  2021-07-28 16:43                                                                       ` Eli Zaretskii
  1 sibling, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-28 16:41 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Wed, 28 Jul 2021 12:36:33 -0400
> Cc: Eli Zaretskii <eliz@gnu.org>,
>  emacs-devel <emacs-devel@gnu.org>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  monnier@iro.umontreal.ca
> 
> > So don't send a change that deletes the hidden text; just send changes
> > in the visible part of the text (that's the only place the user can make
> > changes). tree-sitter will only run the scanner on the change regions,
> > so it will only request text from the visible part of the buffer;
> > all the requests will succeed.
> 
> Then we are not hiding the hidden text from tree-sitter. The implementation you described, IIUC, is essentially do nothing special when the buffer is narrowed. 

If the TS parser is called while the narrowing is in effect, it will
be unable to access text beyond BEGV..ZV.  So in that case the
narrowing _will_ affect TS.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-28 16:36                                                                     ` Yuan Fu
  2021-07-28 16:41                                                                       ` Eli Zaretskii
@ 2021-07-28 16:43                                                                       ` Eli Zaretskii
  2021-07-28 17:47                                                                         ` Yuan Fu
  1 sibling, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-28 16:43 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Wed, 28 Jul 2021 12:36:33 -0400
> Cc: Eli Zaretskii <eliz@gnu.org>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  monnier@iro.umontreal.ca, emacs-devel <emacs-devel@gnu.org>
> 
> > What, exactly, will the buffer-text fetch code do if tree-sitter
> > violates the narrowing (by some error in tree-sitter or user code)?
> > throw an exception? return a null string?
> 
> In my current implementation, if tree-sitter access buffer content outside of narrowed region, it reads whitespaces, if it access buffer content outside of the buffer, it reads null string. Neither case throws an error.

What does TS expect the reader function to return when it hits the
beginning or end of buffer text?  I think we should behave the same
when it tries to go beyond the accessible portion.  There should be no
difference between going beyond the restriction and going beyond EOB.




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-28 16:43                                                                       ` Eli Zaretskii
@ 2021-07-28 17:47                                                                         ` Yuan Fu
  2021-07-28 17:54                                                                           ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-07-28 17:47 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, Stephen Leake, monnier, emacs-devel



> On Jul 28, 2021, at 12:43 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: Yuan Fu <casouri@gmail.com>
>> Date: Wed, 28 Jul 2021 12:36:33 -0400
>> Cc: Eli Zaretskii <eliz@gnu.org>,
>> Clément Pit-Claudel <cpitclaudel@gmail.com>,
>> monnier@iro.umontreal.ca, emacs-devel <emacs-devel@gnu.org>
>> 
>>> What, exactly, will the buffer-text fetch code do if tree-sitter
>>> violates the narrowing (by some error in tree-sitter or user code)?
>>> throw an exception? return a null string?
>> 
>> In my current implementation, if tree-sitter access buffer content outside of narrowed region, it reads whitespaces, if it access buffer content outside of the buffer, it reads null string. Neither case throws an error.
> 
> What does TS expect the reader function to return when it hits the
> beginning or end of buffer text?  I think we should behave the same
> when it tries to go beyond the accessible portion.  There should be no
> difference between going beyond the restriction and going beyond EOB.
> 

It expect the read function set *read_bytes to 0 when it reached the end of the buffer. Tree-sitter never “hit the beginning of the buffer text” because it doesn’t read backward. I’m pretty sure tree-sitter expects to always be able to read from BOB.

Could you describe the desired effect on tree-sitter when the buffer is narrowed? If we just deny accessibility of the hidden region from tree-sitter, tree-sitter is still aware of the hidden text, because it has previously parsed the hidden text and stored the result in the parse tree.

My current implementation is to “replace” the hidden region with whitespaces. When the buffer is narrowed and tree-sitter is asked to re-parse (by some user command), I tell tree-sitter that the hidden portion of the buffer has changed, then during the re-parse, tree-sitter will re-scan those parts, and reads whitespaces.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-28 17:47                                                                         ` Yuan Fu
@ 2021-07-28 17:54                                                                           ` Eli Zaretskii
  2021-07-28 18:46                                                                             ` Yuan Fu
  2021-07-29 23:01                                                                             ` Stephen Leake
  0 siblings, 2 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-28 17:54 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Wed, 28 Jul 2021 13:47:42 -0400
> Cc: Stephen Leake <stephen_leake@stephe-leake.org>,
>  cpitclaudel@gmail.com,
>  monnier@iro.umontreal.ca,
>  emacs-devel@gnu.org
> 
> Could you describe the desired effect on tree-sitter when the buffer is narrowed?

The behavior should be the same as if the text before and after the
narrowed region didn't exist.

If we just deny accessibility of the hidden region from tree-sitter, tree-sitter is still aware of the hidden text, because it has previously parsed the hidden text and stored the result in the parse tree.

The adherence to narrowing is for the use cases where TS is _always_
invoked on the same narrowed region.  You seem to be thinking about
changes in the narrowing while TS is parsing, or between consecutive
re-parsing calls, but I see no interesting/important use cases which
would need to do that.  And if there are some tricky cases which do
need this, the respective Lisp programs will have to deal with the
problem.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-28 17:54                                                                           ` Eli Zaretskii
@ 2021-07-28 18:46                                                                             ` Yuan Fu
  2021-07-28 19:00                                                                               ` Eli Zaretskii
                                                                                                 ` (2 more replies)
  2021-07-29 23:01                                                                             ` Stephen Leake
  1 sibling, 3 replies; 284+ messages in thread
From: Yuan Fu @ 2021-07-28 18:46 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, Stephen Leake, monnier, emacs-devel



> On Jul 28, 2021, at 1:54 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: Yuan Fu <casouri@gmail.com>
>> Date: Wed, 28 Jul 2021 13:47:42 -0400
>> Cc: Stephen Leake <stephen_leake@stephe-leake.org>,
>> cpitclaudel@gmail.com,
>> monnier@iro.umontreal.ca,
>> emacs-devel@gnu.org
>> 
>> Could you describe the desired effect on tree-sitter when the buffer is narrowed?
> 
> The behavior should be the same as if the text before and after the
> narrowed region didn't exist.
> 
> If we just deny accessibility of the hidden region from tree-sitter, tree-sitter is still aware of the hidden text, because it has previously parsed the hidden text and stored the result in the parse tree.
> 
> The adherence to narrowing is for the use cases where TS is _always_
> invoked on the same narrowed region.  You seem to be thinking about
> changes in the narrowing while TS is parsing, or between consecutive
> re-parsing calls, but I see no interesting/important use cases which
> would need to do that.  And if there are some tricky cases which do
> need this, the respective Lisp programs will have to deal with the
> problem.

That makes sense. However it bring up a problem. Consider such a buffer: XXAAXX. Say lisp narrows to AA and creates a tree-sitter parser. Then lisp widens the buffer, and user inserts B in front of AA. Now the buffer is XXBAAXX. Emacs has two options to convey this change to the tree-sitter parser: 1) it does not, then tree-sitter still thinks the buffer is AA, essentially the portion where tree-sitter sees is pushed forward by one character, 2) it tells tree-sitter the user inserted a character at the beginning, then tree-sitter thinks the buffer is BAA. Which option is correct depends on how does lisp later narrows: if lisp narrows to AA, then option 1 is correct, if lisp narrows to BAA, then option 2 is correct. But how do we know which option is correct before lisp narrows?

Yuan




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-28 18:46                                                                             ` Yuan Fu
@ 2021-07-28 19:00                                                                               ` Eli Zaretskii
  2021-07-29 14:35                                                                                 ` Yuan Fu
  2021-07-29 23:06                                                                               ` How to add pseudo vector types Stephen Leake
  2021-07-30  0:35                                                                               ` Richard Stallman
  2 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-28 19:00 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Wed, 28 Jul 2021 14:46:03 -0400
> Cc: Stephen Leake <stephen_leake@stephe-leake.org>,
>  cpitclaudel@gmail.com,
>  monnier@iro.umontreal.ca,
>  emacs-devel@gnu.org
> 
> > The adherence to narrowing is for the use cases where TS is _always_
> > invoked on the same narrowed region.  You seem to be thinking about
> > changes in the narrowing while TS is parsing, or between consecutive
> > re-parsing calls, but I see no interesting/important use cases which
> > would need to do that.  And if there are some tricky cases which do
> > need this, the respective Lisp programs will have to deal with the
> > problem.
> 
> That makes sense. However it bring up a problem. Consider such a buffer: XXAAXX. Say lisp narrows to AA and creates a tree-sitter parser. Then lisp widens the buffer, and user inserts B in front of AA. Now the buffer is XXBAAXX. Emacs has two options to convey this change to the tree-sitter parser: 1) it does not, then tree-sitter still thinks the buffer is AA, essentially the portion where tree-sitter sees is pushed forward by one character, 2) it tells tree-sitter the user inserted a character at the beginning, then tree-sitter thinks the buffer is BAA. Which option is correct depends on how does lisp later narrows: if lisp narrows to AA, then option 1 is correct, if lisp narrows to BAA, then option 2 is correct. But how do we know which option is correct before lisp narrows?

We don't need to know.  The Lisp program which needs to handle this
situation will have to figure out what is right in that case, "right"
in the sense that it produces the desired results after communicating
the changes to TS.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-28 19:00                                                                               ` Eli Zaretskii
@ 2021-07-29 14:35                                                                                 ` Yuan Fu
  2021-07-29 15:28                                                                                   ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-07-29 14:35 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Clément Pit-Claudel, Stephen Leake, Stefan Monnier, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1728 bytes --]

>> 
>> That makes sense. However it bring up a problem. Consider such a buffer: XXAAXX. Say lisp narrows to AA and creates a tree-sitter parser. Then lisp widens the buffer, and user inserts B in front of AA. Now the buffer is XXBAAXX. Emacs has two options to convey this change to the tree-sitter parser: 1) it does not, then tree-sitter still thinks the buffer is AA, essentially the portion where tree-sitter sees is pushed forward by one character, 2) it tells tree-sitter the user inserted a character at the beginning, then tree-sitter thinks the buffer is BAA. Which option is correct depends on how does lisp later narrows: if lisp narrows to AA, then option 1 is correct, if lisp narrows to BAA, then option 2 is correct. But how do we know which option is correct before lisp narrows?
> 
> We don't need to know.  The Lisp program which needs to handle this
> situation will have to figure out what is right in that case, "right"
> in the sense that it produces the desired results after communicating
> the changes to TS.

The difficulty is that what tree-sitter sees must be consistent. If Emacs updates tree-sitter with option 1 and lisp later choose option 2, the content that tree-sitter sees is not consistent. Anyway, I found a way that avoids this issue: the bounds of tree-sitter’s visible region never changes, and the next time when lisp narrows to a different region, we update tree-sitter’s bound to match that of the narrowing. Here is the latest patch. If the code is not entirely straightforward, I’m happy to add more comment to explain it.

I set up a linux machine and tried to debug the crashing problem, but it didn’t crash. Seems the crash only appears on my Mac...

Yuan


[-- Attachment #2: ts.5.patch --]
[-- Type: application/octet-stream, Size: 23721 bytes --]

From 62fc019a7f57119329d53b9b8a3e8b5c1e61b27f Mon Sep 17 00:00:00 2001
From: Yuan Fu <casouri@gmail.com>
Date: Wed, 28 Jul 2021 21:08:43 -0400
Subject: [PATCH] checkpoint 5

- Move define_error out of json.c
- Add narrowing support
---
 lisp/tree-sitter.el           |  11 +-
 src/eval.c                    |  13 ++
 src/json.c                    |  16 ---
 src/lisp.h                    |   5 +
 src/tree_sitter.c             | 231 +++++++++++++++++++++++-----------
 src/tree_sitter.h             |  15 ++-
 test/src/tree-sitter-tests.el |  53 ++++++++
 7 files changed, 251 insertions(+), 93 deletions(-)

diff --git a/lisp/tree-sitter.el b/lisp/tree-sitter.el
index a6ecb09386..8a887bb406 100644
--- a/lisp/tree-sitter.el
+++ b/lisp/tree-sitter.el
@@ -102,12 +102,13 @@ tree-sitter-font-lock-settings
 
 PATTERN is a tree-sitter query pattern. (See manual for how to
 write query patterns.)  This pattern should capture nodes with
-either face names or function names.  If captured with a face
-name, the node's corresponding text in the buffer is fontified
-with that face; if captured with a function name, the function is
-called with three arguments, BEG END NODE, where BEG and END
+either face symbols or function symbols.  If captured with a face
+symbol, the node's corresponding text in the buffer is fontified
+with that face; if captured with a function symbol, the function
+is called with three arguments, BEG END NODE, where BEG and END
 marks the span of the corresponding text, and NODE is the node
-itself.")
+itself.  If a symbol is both a face and a function, it is treated
+as a face.")
 
 (defun tree-sitter-fontify-region-function (beg end &optional verbose)
   "Fontify the region between BEG and END.
diff --git a/src/eval.c b/src/eval.c
index 18faa0b9b1..33c0763f38 100644
--- a/src/eval.c
+++ b/src/eval.c
@@ -1956,6 +1956,19 @@ signal_error (const char *s, Lisp_Object arg)
   xsignal (Qerror, Fcons (build_string (s), arg));
 }
 
+void
+define_error (Lisp_Object name, const char *message, Lisp_Object parent)
+{
+  eassert (SYMBOLP (name));
+  eassert (SYMBOLP (parent));
+  Lisp_Object parent_conditions = Fget (parent, Qerror_conditions);
+  eassert (CONSP (parent_conditions));
+  eassert (!NILP (Fmemq (parent, parent_conditions)));
+  eassert (NILP (Fmemq (name, parent_conditions)));
+  Fput (name, Qerror_conditions, pure_cons (name, parent_conditions));
+  Fput (name, Qerror_message, build_pure_c_string (message));
+}
+
 /* Use this for arithmetic overflow, e.g., when an integer result is
    too large even for a bignum.  */
 void
diff --git a/src/json.c b/src/json.c
index 3f1d27ad7f..ff28143a3c 100644
--- a/src/json.c
+++ b/src/json.c
@@ -1098,22 +1098,6 @@ DEFUN ("json-parse-buffer", Fjson_parse_buffer, Sjson_parse_buffer,
   return unbind_to (count, lisp);
 }
 
-/* Simplified version of 'define-error' that works with pure
-   objects.  */
-
-static void
-define_error (Lisp_Object name, const char *message, Lisp_Object parent)
-{
-  eassert (SYMBOLP (name));
-  eassert (SYMBOLP (parent));
-  Lisp_Object parent_conditions = Fget (parent, Qerror_conditions);
-  eassert (CONSP (parent_conditions));
-  eassert (!NILP (Fmemq (parent, parent_conditions)));
-  eassert (NILP (Fmemq (name, parent_conditions)));
-  Fput (name, Qerror_conditions, pure_cons (name, parent_conditions));
-  Fput (name, Qerror_message, build_pure_c_string (message));
-}
-
 void
 syms_of_json (void)
 {
diff --git a/src/lisp.h b/src/lisp.h
index e439447283..d30509b61a 100644
--- a/src/lisp.h
+++ b/src/lisp.h
@@ -5127,6 +5127,11 @@ maybe_gc (void)
     maybe_garbage_collect ();
 }
 
+/* Simplified version of 'define-error' that works with pure
+   objects.  */
+void
+define_error (Lisp_Object name, const char *message, Lisp_Object parent);
+
 INLINE_HEADER_END
 
 #endif /* EMACS_LISP_H */
diff --git a/src/tree_sitter.c b/src/tree_sitter.c
index e9f8ddc7e3..5e16df7758 100644
--- a/src/tree_sitter.c
+++ b/src/tree_sitter.c
@@ -19,17 +19,8 @@ Copyright (C) 2021 Free Software Foundation, Inc.
 
 #include <config.h>
 
-#include <sys/types.h>
-#include <sys/stat.h>
-#include <sys/param.h>
-#include <errno.h>
-#include <stdio.h>
-#include <stdlib.h>
-#include <unistd.h>
-
 #include "lisp.h"
 #include "buffer.h"
-#include "coding.h"
 #include "tree_sitter.h"
 
 /* parser.h defines a macro ADVANCE that conflicts with alloc.c.  */
@@ -61,6 +52,16 @@ DEFUN ("tree-sitter-node-p",
 
 /*** Parsing functions */
 
+static inline void
+ts_tree_edit_1 (TSTree *tree, ptrdiff_t start_byte,
+		ptrdiff_t old_end_byte, ptrdiff_t new_end_byte)
+{
+  TSPoint dummy_point = {0, 0};
+  TSInputEdit edit = {start_byte, old_end_byte, new_end_byte,
+		      dummy_point, dummy_point, dummy_point};
+  ts_tree_edit (tree, &edit);
+}
+
 /* Update each parser's tree after the user made an edit.  This
 function does not parse the buffer and only updates the tree. (So it
 should be very fast.)  */
@@ -68,18 +69,38 @@ DEFUN ("tree-sitter-node-p",
 ts_record_change (ptrdiff_t start_byte, ptrdiff_t old_end_byte,
 		  ptrdiff_t new_end_byte)
 {
+  eassert(start_byte <= old_end_byte);
+  eassert(start_byte <= new_end_byte);
+
   Lisp_Object parser_list = Fsymbol_value (Qtree_sitter_parser_list);
-  TSPoint dummy_point = {0, 0};
-  TSInputEdit edit = {start_byte, old_end_byte, new_end_byte,
-		      dummy_point, dummy_point, dummy_point};
+
   while (!NILP (parser_list))
     {
       Lisp_Object lisp_parser = Fcar (parser_list);
       TSTree *tree = XTS_PARSER (lisp_parser)->tree;
       if (tree != NULL)
-	ts_tree_edit (tree, &edit);
-      XTS_PARSER (lisp_parser)->need_reparse = true;
-      parser_list = Fcdr (parser_list);
+	{
+	  /* We "clip" the change to between visible_beg and
+	     visible_end.  It is okay if visible_end ends up larger
+	     than BUF_Z, tree-sitter only access buffer text during
+	     re-parse, and we will adjust visible_beg/end before
+	     re-parse.  */
+	  ptrdiff_t visible_beg = XTS_PARSER (lisp_parser)->visible_beg;
+	  ptrdiff_t visible_end = XTS_PARSER (lisp_parser)->visible_end;
+
+	  ptrdiff_t visible_start =
+	    max (visible_beg, start_byte) - visible_beg;
+	  ptrdiff_t visible_old_end =
+	    min (visible_end, old_end_byte) - visible_beg;
+	  ptrdiff_t visible_new_end =
+	    min (visible_end, new_end_byte) - visible_beg;
+
+	  ts_tree_edit_1 (tree, visible_start, visible_old_end,
+			  visible_new_end);
+	  XTS_PARSER (lisp_parser)->need_reparse = true;
+
+	  parser_list = Fcdr (parser_list);
+	}
     }
 }
 
@@ -93,16 +114,67 @@ ts_ensure_parsed (Lisp_Object parser)
   TSParser *ts_parser = XTS_PARSER (parser)->parser;
   TSTree *tree = XTS_PARSER(parser)->tree;
   TSInput input = XTS_PARSER (parser)->input;
+  struct buffer *buffer = XTS_PARSER (parser)->buffer;
+
+  /* Before we parse, catch up with the narrowing situation.  We
+     change visible_beg and visible_end to match BUF_BEGV_BYTE and
+     BUF_ZV_BYTE, and inform tree-sitter of the change.  */
+  ptrdiff_t visible_beg = XTS_PARSER (parser)->visible_beg;
+  ptrdiff_t visible_end = XTS_PARSER (parser)->visible_end;
+  /* Before re-parse, we want to move the visible range of tree-sitter
+     to matched the narrowed range. For example:
+     Move ________|____|__
+     to   |____|__________ */
+
+  /* 1. Make sure visible_beg <= BUF_BEGV_BYTE.  */
+  if (visible_beg > BUF_BEGV_BYTE (buffer))
+    {
+      /* Tree-sitter sees: insert at the beginning. */
+      ts_tree_edit_1 (tree, 0, 0, visible_beg - BUF_BEGV_BYTE (buffer));
+      visible_beg = BUF_BEGV_BYTE (buffer);
+    }
+  /* 2. Make sure visible_end = BUF_ZV_BYTE.  */
+  if (visible_end < BUF_ZV_BYTE (buffer))
+    {
+      /* Tree-sitter sees: insert at the end.  */
+      ts_tree_edit_1 (tree, visible_end - visible_beg,
+		      visible_end - visible_beg,
+		      BUF_ZV_BYTE (buffer) - visible_beg);
+      visible_end = BUF_ZV_BYTE (buffer);
+    }
+  else if (visible_end > BUF_ZV_BYTE (buffer))
+    {
+      /* Tree-sitter sees: delete at the end.  */
+      ts_tree_edit_1 (tree, BUF_ZV_BYTE (buffer) - visible_beg,
+		      visible_end - visible_beg,
+		      BUF_ZV_BYTE (buffer) - visible_beg);
+      visible_end = BUF_ZV_BYTE (buffer);
+    }
+  /* 3. Make sure visible_beg = BUF_BEGV_BYTE.  */
+  if (visible_beg < BUF_BEGV_BYTE (buffer))
+    {
+      /* Tree-sitter sees: delete at the beginning.  */
+      ts_tree_edit_1 (tree, 0, BUF_BEGV_BYTE (buffer) - visible_beg, 0);
+      visible_beg = BUF_BEGV_BYTE (buffer);
+    }
+  XTS_PARSER (parser)->visible_beg = visible_beg;
+  XTS_PARSER (parser)->visible_end = visible_end;
+
   TSTree *new_tree = ts_parser_parse(ts_parser, tree, input);
-  /* This should be very rare: it only happens when 1) language is not
-     set (impossible in Emacs because the user has to supply a
-     language to create a parser), 2) parse canceled due to timeout
-     (impossible because we don't set a timeout), 3) parse canceled
-     due to cancellation flag (impossible because we don't set the
-     flag).  (See comments for ts_parser_parse in
+  /* This should be very rare (impossible, really): it only happens
+     when 1) language is not set (impossible in Emacs because the user
+     has to supply a language to create a parser), 2) parse canceled
+     due to timeout (impossible because we don't set a timeout), 3)
+     parse canceled due to cancellation flag (impossible because we
+     don't set the flag).  (See comments for ts_parser_parse in
      tree_sitter/api.h.)  */
   if (new_tree == NULL)
-    signal_error ("Parse failed", parser);
+    {
+      Lisp_Object buf;
+      XSETBUFFER(buf, buffer);
+      xsignal1 (Qtree_sitter_parse_error, buf);
+    }
+
   ts_tree_delete (tree);
   XTS_PARSER (parser)->tree = new_tree;
   XTS_PARSER (parser)->need_reparse = false;
@@ -110,13 +182,18 @@ ts_ensure_parsed (Lisp_Object parser)
 }
 
 /* This is the read function provided to tree-sitter to read from a
-   buffer.  It reads one character at a time and automatically skip
+   buffer.  It reads one character at a time and automatically skips
    the gap.  */
 const char*
-ts_read_buffer (void *buffer, uint32_t byte_index,
+ts_read_buffer (void *parser, uint32_t byte_index,
 		TSPoint position, uint32_t *bytes_read)
 {
-  ptrdiff_t byte_pos = byte_index + 1;
+  struct buffer *buffer = ((struct Lisp_TS_Parser *) parser)->buffer;
+  ptrdiff_t visible_beg = ((struct Lisp_TS_Parser *) parser)->visible_beg;
+  ptrdiff_t byte_pos = byte_index + visible_beg;
+  /* We will make sure visible_beg >= BUF_BEG_BYTE before re-parse (in
+     ts_ensure_parsed), so byte_pos will never be smaller than
+     BUF_BEG_BYTE (unless byte_index < 0).  */
 
   /* Read one character.  Tree-sitter wants us to set bytes_read to 0
      if it reads to the end of buffer.  It doesn't say what it wants
@@ -126,26 +203,26 @@ ts_read_buffer (void *buffer, uint32_t byte_index,
   int len;
   /* This function could run from a user command, so it is better to
      do nothing instead of raising an error. (It was a pain in the a**
-     to read mega-if-conditions in Emacs source, so I write the two
-     branches separately, hoping the compiler can merge them.)  */
-  if (!BUFFER_LIVE_P ((struct buffer *) buffer))
+     to decrypt mega-if-conditions in Emacs source, so I wrote the two
+     branches separately.)  */
+  if (!BUFFER_LIVE_P (buffer))
     {
       beg = "";
       len = 0;
     }
-  // TODO BUF_ZV_BYTE?
-  else if (byte_pos >= BUF_Z_BYTE ((struct buffer *) buffer))
+  /* Reached visible end-of-buffer, tell tree-sitter to read no more.  */
+  else if (byte_pos >= BUF_ZV_BYTE (buffer))
     {
       beg = "";
       len = 0;
     }
+  /* Normal case, read a character.  */
   else
     {
       beg = (char *) BUF_BYTE_ADDRESS (buffer, byte_pos);
-      len = BYTES_BY_CHAR_HEAD ((int) beg);
+      len = BYTES_BY_CHAR_HEAD ((int) *beg);
     }
   *bytes_read = (uint32_t) len;
-
   return beg;
 }
 
@@ -158,13 +235,16 @@ make_ts_parser (struct buffer *buffer, TSParser *parser,
 {
   struct Lisp_TS_Parser *lisp_parser
     = ALLOCATE_PSEUDOVECTOR (struct Lisp_TS_Parser, name, PVEC_TS_PARSER);
+
   lisp_parser->name = name;
   lisp_parser->buffer = buffer;
   lisp_parser->parser = parser;
   lisp_parser->tree = tree;
-  TSInput input = {buffer, ts_read_buffer, TSInputEncodingUTF8};
+  TSInput input = {lisp_parser, ts_read_buffer, TSInputEncodingUTF8};
   lisp_parser->input = input;
   lisp_parser->need_reparse = true;
+  lisp_parser->visible_beg = BUF_BEGV (buffer);
+  lisp_parser->visible_end = BUF_ZV (buffer);
   return make_lisp_ptr (lisp_parser, Lisp_Vectorlike);
 }
 
@@ -287,7 +367,7 @@ DEFUN ("tree-sitter-parse-string",
   /* See comment in ts_ensure_parsed for possible reasons for a
      failure.  */
   if (tree == NULL)
-    signal_error ("Failed to parse STRING", string);
+    xsignal1 (Qtree_sitter_parse_error, string);
 
   TSNode root_node = ts_tree_root_node (tree);
 
@@ -535,7 +615,9 @@ DEFUN ("tree-sitter-node-first-child-for-byte",
 {
   CHECK_INTEGER (pos);
 
-  struct buffer *buf = (XTS_PARSER (XTS_NODE (node)->parser)->buffer);
+  struct buffer *buf = XTS_PARSER (XTS_NODE (node)->parser)->buffer;
+  ptrdiff_t visible_beg =
+    XTS_PARSER (XTS_NODE (node)->parser)->visible_beg;
   ptrdiff_t byte_pos = XFIXNUM (pos);
 
   if (byte_pos < BUF_BEGV_BYTE (buf) || byte_pos > BUF_ZV_BYTE (buf))
@@ -544,9 +626,10 @@ DEFUN ("tree-sitter-node-first-child-for-byte",
   TSNode ts_node = XTS_NODE (node)->node;
   TSNode child;
   if (NILP (named))
-    child = ts_node_first_child_for_byte (ts_node, byte_pos - 1);
+    child = ts_node_first_child_for_byte (ts_node, byte_pos - visible_beg);
   else
-    child = ts_node_first_named_child_for_byte (ts_node, byte_pos - 1);
+    child = ts_node_first_named_child_for_byte
+      (ts_node, byte_pos - visible_beg);
 
   if (ts_node_is_null(child))
     return Qnil;
@@ -566,7 +649,9 @@ DEFUN ("tree-sitter-node-descendant-for-byte-range",
   CHECK_INTEGER (beg);
   CHECK_INTEGER (end);
 
-  struct buffer *buf = (XTS_PARSER (XTS_NODE (node)->parser)->buffer);
+  struct buffer *buf = XTS_PARSER (XTS_NODE (node)->parser)->buffer;
+  ptrdiff_t visible_beg =
+    XTS_PARSER (XTS_NODE (node)->parser)->visible_beg;
   ptrdiff_t byte_beg = XFIXNUM (beg);
   ptrdiff_t byte_end = XFIXNUM (end);
 
@@ -580,10 +665,10 @@ DEFUN ("tree-sitter-node-descendant-for-byte-range",
   TSNode child;
   if (NILP (named))
     child = ts_node_descendant_for_byte_range
-      (ts_node, byte_beg - 1 , byte_end - 1);
+      (ts_node, byte_beg - visible_beg , byte_end - visible_beg);
   else
     child = ts_node_named_descendant_for_byte_range
-      (ts_node, byte_beg - 1, byte_end - 1);
+      (ts_node, byte_beg - visible_beg, byte_end - visible_beg);
 
   if (ts_node_is_null(child))
     return Qnil;
@@ -593,31 +678,24 @@ DEFUN ("tree-sitter-node-descendant-for-byte-range",
 
 /* Query functions */
 
-Lisp_Object ts_query_error_to_string (TSQueryError error)
+char*
+ts_query_error_to_string (TSQueryError error)
 {
-  char *error_name;
   switch (error)
     {
     case TSQueryErrorNone:
-      error_name = "none";
-      break;
+      return "none";
     case TSQueryErrorSyntax:
-      error_name = "syntax";
-      break;
+      return "syntax";
     case TSQueryErrorNodeType:
-      error_name = "node type";
-      break;
+      return "node type";
     case TSQueryErrorField:
-      error_name = "field";
-      break;
+      return "field";
     case TSQueryErrorCapture:
-      error_name = "capture";
-      break;
+      return "capture";
     case TSQueryErrorStructure:
-      error_name = "structure";
-      break;
+      return "structure";
     }
-  return  make_pure_c_string (error_name, strlen(error_name));
 }
 
 DEFUN ("tree-sitter-query-capture",
@@ -634,7 +712,7 @@ DEFUN ("tree-sitter-query-capture",
 BEG and END, if _both_ non-nil, specifies the range in which the query
 is executed.
 
-Return nil if the query failed.  */)
+Raise an tree-sitter-query-error if PATTERN is malformed.  */)
   (Lisp_Object node, Lisp_Object pattern,
    Lisp_Object beg, Lisp_Object end)
 {
@@ -643,47 +721,56 @@ DEFUN ("tree-sitter-query-capture",
 
   TSNode ts_node = XTS_NODE (node)->node;
   Lisp_Object lisp_parser = XTS_NODE (node)->parser;
+  ptrdiff_t visible_beg =
+    XTS_PARSER (XTS_NODE (node)->parser)->visible_beg;
   const TSLanguage *lang = ts_parser_language
     (XTS_PARSER (lisp_parser)->parser);
   char *source = SSDATA (pattern);
 
+
   uint32_t error_offset;
-  uint32_t error_type;
+  TSQueryError error_type;
   TSQuery *query = ts_query_new (lang, source, strlen (source),
 				 &error_offset, &error_type);
   TSQueryCursor *cursor = ts_query_cursor_new ();
 
   if (query == NULL)
     {
-      // FIXME: Signal an error?
-      return Qnil;
+      // FIXME: Still crashes, debug when I can get a gdb.
+      xsignal2 (Qtree_sitter_query_error,
+		make_fixnum (error_offset),
+		build_string (ts_query_error_to_string (error_type)));
     }
   if (!NILP (beg) && !NILP (end))
     {
       EMACS_INT beg_byte = XFIXNUM (beg);
       EMACS_INT end_byte = XFIXNUM (end);
       ts_query_cursor_set_byte_range
-	(cursor, (uint32_t) beg_byte - 1, (uint32_t) end_byte - 1);
+	(cursor, (uint32_t) beg_byte - visible_beg,
+	 (uint32_t) end_byte - visible_beg);
     }
 
   ts_query_cursor_exec (cursor, query, ts_node);
   TSQueryMatch match;
-  TSQueryCapture capture;
+
   Lisp_Object result = Qnil;
-  Lisp_Object entry;
-  Lisp_Object captured_node;
-  const char *capture_name;
-  uint32_t capture_name_len;
   while (ts_query_cursor_next_match (cursor, &match))
     {
       const TSQueryCapture *captures = match.captures;
       for (int idx=0; idx < match.capture_count; idx++)
 	{
+	  TSQueryCapture capture;
+	  Lisp_Object captured_node;
+	  const char *capture_name;
+	  Lisp_Object entry;
+	  uint32_t capture_name_len;
+
 	  capture = captures[idx];
 	  captured_node = make_ts_node(lisp_parser, capture.node);
 	  capture_name = ts_query_capture_name_for_id
 	    (query, capture.index, &capture_name_len);
-	  entry = Fcons (intern_c_string (capture_name),
+	  entry = Fcons (intern_c_string_1
+			 (capture_name, capture_name_len),
 			 captured_node);
 	  result = Fcons (entry, result);
 	}
@@ -705,11 +792,15 @@ syms_of_tree_sitter (void)
   DEFSYM (Qhas_changes, "has-changes");
   DEFSYM (Qhas_error, "has-error");
 
+  DEFSYM(Qtree_sitter_error, "tree-sitter-error");
   DEFSYM (Qtree_sitter_query_error, "tree-sitter-query-error");
-  Fput (Qtree_sitter_query_error, Qerror_conditions,
-	pure_list (Qtree_sitter_query_error, Qerror));
-  Fput (Qtree_sitter_query_error, Qerror_message,
-	build_pure_c_string ("Error with query pattern"))
+  DEFSYM (Qtree_sitter_parse_error, "tree-sitter-parse-error")
+  define_error (Qtree_sitter_error, "Generic tree-sitter error", Qerror);
+  define_error (Qtree_sitter_query_error, "Query pattern is malformed",
+		Qtree_sitter_error);
+  define_error (Qtree_sitter_parse_error, "Parse failed",
+		Qtree_sitter_error);
+
 
   DEFSYM (Qtree_sitter_parser_list, "tree-sitter-parser-list");
   DEFVAR_LISP ("tree-sitter-parser-list", Vtree_sitter_parser_list,
diff --git a/src/tree_sitter.h b/src/tree_sitter.h
index e9b4a71326..7e0fec0ee9 100644
--- a/src/tree_sitter.h
+++ b/src/tree_sitter.h
@@ -20,8 +20,6 @@ Copyright (C) 2021 Free Software Foundation, Inc.
 #ifndef EMACS_TREE_SITTER_H
 #define EMACS_TREE_SITTER_H
 
-#include <sys/types.h>
-
 #include "lisp.h"
 
 #include <tree_sitter/api.h>
@@ -33,12 +31,25 @@ #define EMACS_TREE_SITTER_H
 struct Lisp_TS_Parser
 {
   union vectorlike_header header;
+  /* A parser's name is just a convenient tag, see docstring for
+     'tree-sitter-make-parser', and 'tree-sitter-get-parser'. */
   Lisp_Object name;
   struct buffer *buffer;
   TSParser *parser;
   TSTree *tree;
   TSInput input;
+  /* Re-parsing an unchanged buffer is not free for tree-sitter, so we
+     only make it re-parse when need_reparse == true.  That usually
+     means some change is made in the buffer.  But others could set
+     this field to true to force tree-sitter to re-parse.  */
   bool need_reparse;
+  /* This two positions record the byte position of the "visible
+     region" that tree-sitter sees.  Unlike markers, These two
+     positions do not change as the user inserts and deletes text
+     around them. Before re-parse, we move these positions to match
+     BUF_BEGV_BYTE and BUF_ZV_BYTE.  */
+  ptrdiff_t visible_beg;
+  ptrdiff_t visible_end;
 };
 
 /* A wrapper around a tree-sitter node.  */
diff --git a/test/src/tree-sitter-tests.el b/test/src/tree-sitter-tests.el
index c61ad678d2..69104568de 100644
--- a/test/src/tree-sitter-tests.el
+++ b/test/src/tree-sitter-tests.el
@@ -148,5 +148,58 @@ tree-sitter-query-api
                          (cdr entry))))
                 (tree-sitter-query-capture root-node pattern)))))))
 
+(ert-deftest tree-sitter-narrow ()
+  "Tests if narrowing works."
+  (with-temp-buffer
+    (let (parser root-node pattern doc-node object-node pair-node)
+      (progn
+        (insert "xxx[1,{\"name\": \"Bob\"},2,3]xxx")
+        (narrow-to-region (+ (point-min) 3) (- (point-max) 3))
+        (setq parser (tree-sitter-create-parser
+                      (current-buffer) (tree-sitter-json)))
+        (setq root-node (tree-sitter-parser-root-node
+                         parser)))
+      ;; This test is from the basic test.
+      (should
+       (equal
+        (tree-sitter-node-string
+         (tree-sitter-parser-root-node parser))
+        "(document (array (number) (object (pair key: (string (string_content)) value: (string (string_content)))) (number) (number)))"))
+
+      (widen)
+      (goto-char (point-min))
+      (insert "ooo")
+      (should (equal "oooxxx[1,{\"name\": \"Bob\"},2,3]xxx"
+                     (buffer-string)))
+      (delete-region 10 26)
+      (should (equal "oooxxx[1,2,3]xxx"
+                     (buffer-string)))
+      (narrow-to-region (+ (point-min) 6) (- (point-max) 3))
+      ;; This test is also from the basic test.
+      (should
+       (equal (tree-sitter-node-string
+               (tree-sitter-parser-root-node parser))
+              "(document (array (number) (number) (number)))"))
+      (widen)
+      (goto-char (point-max))
+      (insert "[1,2]")
+      (should (equal "oooxxx[1,2,3]xxx[1,2]"
+                     (buffer-string)))
+      (narrow-to-region (- (point-max) 5) (point-max))
+      (should
+       (equal (tree-sitter-node-string
+               (tree-sitter-parser-root-node parser))
+              "(document (array (number) (number)))"))
+      (widen)
+      (goto-char (point-min))
+      (insert "[1]")
+      (should (equal "[1]oooxxx[1,2,3]xxx[1,2]"
+                     (buffer-string)))
+      (narrow-to-region (point-min) (+ (point-min) 3))
+      (should
+       (equal (tree-sitter-node-string
+               (tree-sitter-parser-root-node parser))
+              "(document (array (number)))")))))
+
 (provide 'tree-sitter-tests)
 ;;; tree-sitter-tests.el ends here
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-29 14:35                                                                                 ` Yuan Fu
@ 2021-07-29 15:28                                                                                   ` Eli Zaretskii
  2021-07-29 15:57                                                                                     ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-29 15:28 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Thu, 29 Jul 2021 10:35:10 -0400
> Cc: Stephen Leake <stephen_leake@stephe-leake.org>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  Stefan Monnier <monnier@iro.umontreal.ca>,
>  emacs-devel@gnu.org
> 
> > We don't need to know.  The Lisp program which needs to handle this
> > situation will have to figure out what is right in that case, "right"
> > in the sense that it produces the desired results after communicating
> > the changes to TS.
> 
> The difficulty is that what tree-sitter sees must be consistent. If Emacs updates tree-sitter with option 1 and lisp later choose option 2, the content that tree-sitter sees is not consistent.

If that happens, it means the Lisp program which does that has a bug
that needs to be fixed.

> Anyway, I found a way that avoids this issue: the bounds of tree-sitter’s visible region never changes, and the next time when lisp narrows to a different region, we update tree-sitter’s bound to match that of the narrowing. Here is the latest patch. If the code is not entirely straightforward, I’m happy to add more comment to explain it.

I'm not sure we should do this, because it means we second-guess what
the Lisp program calling TS intends to do.  Why should we do that,
instead of leaving it to the Lisp program to DTRT?  And what happens
if our guess is wrong?



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-29 15:28                                                                                   ` Eli Zaretskii
@ 2021-07-29 15:57                                                                                     ` Yuan Fu
  2021-07-29 16:21                                                                                       ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-07-29 15:57 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Clément Pit-Claudel, Stephen Leake, monnier, emacs-devel



> On Jul 29, 2021, at 11:28 AM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: Yuan Fu <casouri@gmail.com>
>> Date: Thu, 29 Jul 2021 10:35:10 -0400
>> Cc: Stephen Leake <stephen_leake@stephe-leake.org>,
>> Clément Pit-Claudel <cpitclaudel@gmail.com>,
>> Stefan Monnier <monnier@iro.umontreal.ca>,
>> emacs-devel@gnu.org
>> 
>>> We don't need to know.  The Lisp program which needs to handle this
>>> situation will have to figure out what is right in that case, "right"
>>> in the sense that it produces the desired results after communicating
>>> the changes to TS.
>> 
>> The difficulty is that what tree-sitter sees must be consistent. If Emacs updates tree-sitter with option 1 and lisp later choose option 2, the content that tree-sitter sees is not consistent.
> 
> If that happens, it means the Lisp program which does that has a bug
> that needs to be fixed.
> 
>> Anyway, I found a way that avoids this issue: the bounds of tree-sitter’s visible region never changes, and the next time when lisp narrows to a different region, we update tree-sitter’s bound to match that of the narrowing. Here is the latest patch. If the code is not entirely straightforward, I’m happy to add more comment to explain it.
> 
> I'm not sure we should do this, because it means we second-guess what
> the Lisp program calling TS intends to do.  Why should we do that,
> instead of leaving it to the Lisp program to DTRT?  And what happens
> if our guess is wrong?

I don’t think the current implementation guesses anything. Let me turn around and ask you what is TRT: if the buffer is xxxAAAxxx, and lisp narrows to AAA and creates a parser, parser sees AAA; now widen, user inserts BBB in front of AAA, what do we tell tree-sitter? Nothing changed, or BBB inserted at the beginning? To where should lisp narrow? BBBAAA, or AAA, or BBB?

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-29 15:57                                                                                     ` Yuan Fu
@ 2021-07-29 16:21                                                                                       ` Eli Zaretskii
  2021-07-29 16:59                                                                                         ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-29 16:21 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Thu, 29 Jul 2021 11:57:56 -0400
> Cc: Stephen Leake <stephen_leake@stephe-leake.org>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  monnier@iro.umontreal.ca,
>  emacs-devel@gnu.org
> 
> >> Anyway, I found a way that avoids this issue: the bounds of tree-sitter’s visible region never changes, and the next time when lisp narrows to a different region, we update tree-sitter’s bound to match that of the narrowing. Here is the latest patch. If the code is not entirely straightforward, I’m happy to add more comment to explain it.
> > 
> > I'm not sure we should do this, because it means we second-guess what
> > the Lisp program calling TS intends to do.  Why should we do that,
> > instead of leaving it to the Lisp program to DTRT?  And what happens
> > if our guess is wrong?
> 
> I don’t think the current implementation guesses anything. Let me turn around and ask you what is TRT: if the buffer is xxxAAAxxx, and lisp narrows to AAA and creates a parser, parser sees AAA; now widen, user inserts BBB in front of AAA, what do we tell tree-sitter? Nothing changed, or BBB inserted at the beginning?

Neither.  We should tell TS that instead of AAA there's now
xxxBBBAAAxxx, because the narrowing was removed.

> To where should lisp narrow? BBBAAA, or AAA, or BBB?

It's the question for the Lisp program, not for the low-level code
which we are discussing.

Anyway, you are once again bothered by a scenario that should not
happen at all: a Lisp program should not call TS first with, then
without narrowing (or the other way around).  I don't see why such
situation should happen, and if they do, the Lisp programs which need
them will have to figure out what to do and how.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-29 16:21                                                                                       ` Eli Zaretskii
@ 2021-07-29 16:59                                                                                         ` Yuan Fu
  2021-07-29 17:38                                                                                           ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-07-29 16:59 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Clément Pit-Claudel, Stephen Leake, Stefan Monnier, emacs-devel



> On Jul 29, 2021, at 12:21 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: Yuan Fu <casouri@gmail.com>
>> Date: Thu, 29 Jul 2021 11:57:56 -0400
>> Cc: Stephen Leake <stephen_leake@stephe-leake.org>,
>> Clément Pit-Claudel <cpitclaudel@gmail.com>,
>> monnier@iro.umontreal.ca,
>> emacs-devel@gnu.org
>> 
>>>> Anyway, I found a way that avoids this issue: the bounds of tree-sitter’s visible region never changes, and the next time when lisp narrows to a different region, we update tree-sitter’s bound to match that of the narrowing. Here is the latest patch. If the code is not entirely straightforward, I’m happy to add more comment to explain it.
>>> 
>>> I'm not sure we should do this, because it means we second-guess what
>>> the Lisp program calling TS intends to do.  Why should we do that,
>>> instead of leaving it to the Lisp program to DTRT?  And what happens
>>> if our guess is wrong?
>> 
>> I don’t think the current implementation guesses anything. Let me turn around and ask you what is TRT: if the buffer is xxxAAAxxx, and lisp narrows to AAA and creates a parser, parser sees AAA; now widen, user inserts BBB in front of AAA, what do we tell tree-sitter? Nothing changed, or BBB inserted at the beginning?
> 
> Neither.  We should tell TS that instead of AAA there's now
> xxxBBBAAAxxx, because the narrowing was removed.

This is the common usage that I imagined:

Narrow
Calls tree-sitter (for fontification etc)
Widen

Users edit the buffer

narrow
Calls tree-sitter (for fontification etc)
Widen

Ideally, tree-sitter only sees the narrowed region because everytime it is called, the buffer is narrowed. However, tree-sitter doesn’t work that way, it needs to be updated when user edits the buffer, when the buffer is widened. If your goal is give lisp control of what tree-sitter sees, we can’t just give tree-sitter the whole buffer whenever the user makes some change.

> 
>> To where should lisp narrow? BBBAAA, or AAA, or BBB?
> 
> It's the question for the Lisp program, not for the low-level code
> which we are discussing.
> 
> Anyway, you are once again bothered by a scenario that should not
> happen at all: a Lisp program should not call TS first with, then
> without narrowing (or the other way around).  I don't see why such
> situation should happen, and if they do, the Lisp programs which need
> them will have to figure out what to do and how.

Even if lisp always call tree-sitter with narrowing, we still need to update tree-sitter when the buffer is widened. Does that make sense?

Yuan




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-29 16:59                                                                                         ` Yuan Fu
@ 2021-07-29 17:38                                                                                           ` Eli Zaretskii
  2021-07-29 17:55                                                                                             ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-29 17:38 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Thu, 29 Jul 2021 12:59:43 -0400
> Cc: Stephen Leake <stephen_leake@stephe-leake.org>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  Stefan Monnier <monnier@iro.umontreal.ca>,
>  emacs-devel@gnu.org
> 
> >> I don’t think the current implementation guesses anything. Let me turn around and ask you what is TRT: if the buffer is xxxAAAxxx, and lisp narrows to AAA and creates a parser, parser sees AAA; now widen, user inserts BBB in front of AAA, what do we tell tree-sitter? Nothing changed, or BBB inserted at the beginning?
> > 
> > Neither.  We should tell TS that instead of AAA there's now
> > xxxBBBAAAxxx, because the narrowing was removed.
> 
> This is the common usage that I imagined:
> 
> Narrow
> Calls tree-sitter (for fontification etc)
> Widen
> 
> Users edit the buffer
> 
> narrow
> Calls tree-sitter (for fontification etc)
> Widen
>
> Ideally, tree-sitter only sees the narrowed region because everytime it is called, the buffer is narrowed. However, tree-sitter doesn’t work that way, it needs to be updated when user edits the buffer, when the buffer is widened. If your goal is give lisp control of what tree-sitter sees, we can’t just give tree-sitter the whole buffer whenever the user makes some change.

In the above scenario, then the Lisp program that narrows the buffer
should figure out how to do that correctly.  The call to TS will then
express the changes in the narrowed region only.

> > Anyway, you are once again bothered by a scenario that should not
> > happen at all: a Lisp program should not call TS first with, then
> > without narrowing (or the other way around).  I don't see why such
> > situation should happen, and if they do, the Lisp programs which need
> > them will have to figure out what to do and how.
> 
> Even if lisp always call tree-sitter with narrowing, we still need to update tree-sitter when the buffer is widened.

No, I don't think so.  Why would we need to?  From the TS POV the text
outside the restriction doesn't exist because it never sees it.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-29 17:38                                                                                           ` Eli Zaretskii
@ 2021-07-29 17:55                                                                                             ` Yuan Fu
  2021-07-29 18:37                                                                                               ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-07-29 17:55 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, Stephen Leake, monnier, emacs-devel

>> 
>> Ideally, tree-sitter only sees the narrowed region because everytime it is called, the buffer is narrowed. However, tree-sitter doesn’t work that way, it needs to be updated when user edits the buffer, when the buffer is widened. If your goal is give lisp control of what tree-sitter sees, we can’t just give tree-sitter the whole buffer whenever the user makes some change.
> 
> In the above scenario, then the Lisp program that narrows the buffer
> should figure out how to do that correctly.  The call to TS will then
> express the changes in the narrowed region only.
> 
>>> Anyway, you are once again bothered by a scenario that should not
>>> happen at all: a Lisp program should not call TS first with, then
>>> without narrowing (or the other way around).  I don't see why such
>>> situation should happen, and if they do, the Lisp programs which need
>>> them will have to figure out what to do and how.
>> 
>> Even if lisp always call tree-sitter with narrowing, we still need to update tree-sitter when the buffer is widened.
> 
> No, I don't think so.  Why would we need to?  From the TS POV the text
> outside the restriction doesn't exist because it never sees it.

Actually, that sounds like how it works in my code right now. After the last few exchanges, I still have the feeling that we are not on the same page. Could you have a look at the code in ts_ensure_parsed and ts_record_change, and see if it aligns with what you consider to be the right thing? If you have read them already and think you understand what are they doing, could you tell me how exactly should these two functions behave, in your opinion? Thanks.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-29 17:55                                                                                             ` Yuan Fu
@ 2021-07-29 18:37                                                                                               ` Eli Zaretskii
  2021-07-29 18:57                                                                                                 ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-29 18:37 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Thu, 29 Jul 2021 13:55:48 -0400
> Cc: Stephen Leake <stephen_leake@stephe-leake.org>,
>  cpitclaudel@gmail.com,
>  monnier@iro.umontreal.ca,
>  emacs-devel@gnu.org
> 
> Actually, that sounds like how it works in my code right now. After the last few exchanges, I still have the feeling that we are not on the same page. Could you have a look at the code in ts_ensure_parsed and ts_record_change, and see if it aligns with what you consider to be the right thing? If you have read them already and think you understand what are they doing, could you tell me how exactly should these two functions behave, in your opinion? Thanks.

Where do I find the latest version of the code?



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-29 18:37                                                                                               ` Eli Zaretskii
@ 2021-07-29 18:57                                                                                                 ` Yuan Fu
  2021-07-30  6:47                                                                                                   ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-07-29 18:57 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Clément Pit-Claudel, Stephen Leake, Stefan Monnier, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 925 bytes --]



> On Jul 29, 2021, at 2:37 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: Yuan Fu <casouri@gmail.com>
>> Date: Thu, 29 Jul 2021 13:55:48 -0400
>> Cc: Stephen Leake <stephen_leake@stephe-leake.org>,
>> cpitclaudel@gmail.com,
>> monnier@iro.umontreal.ca,
>> emacs-devel@gnu.org
>> 
>> Actually, that sounds like how it works in my code right now. After the last few exchanges, I still have the feeling that we are not on the same page. Could you have a look at the code in ts_ensure_parsed and ts_record_change, and see if it aligns with what you consider to be the right thing? If you have read them already and think you understand what are they doing, could you tell me how exactly should these two functions behave, in your opinion? Thanks.
> 
> Where do I find the latest version of the code?

A few messages back I attached a patch, ts.5.patch. Actually I can just attach it again, here.

Yuan


[-- Attachment #2: ts.5.patch --]
[-- Type: application/octet-stream, Size: 23721 bytes --]

From 62fc019a7f57119329d53b9b8a3e8b5c1e61b27f Mon Sep 17 00:00:00 2001
From: Yuan Fu <casouri@gmail.com>
Date: Wed, 28 Jul 2021 21:08:43 -0400
Subject: [PATCH] checkpoint 5

- Move define_error out of json.c
- Add narrowing support
---
 lisp/tree-sitter.el           |  11 +-
 src/eval.c                    |  13 ++
 src/json.c                    |  16 ---
 src/lisp.h                    |   5 +
 src/tree_sitter.c             | 231 +++++++++++++++++++++++-----------
 src/tree_sitter.h             |  15 ++-
 test/src/tree-sitter-tests.el |  53 ++++++++
 7 files changed, 251 insertions(+), 93 deletions(-)

diff --git a/lisp/tree-sitter.el b/lisp/tree-sitter.el
index a6ecb09386..8a887bb406 100644
--- a/lisp/tree-sitter.el
+++ b/lisp/tree-sitter.el
@@ -102,12 +102,13 @@ tree-sitter-font-lock-settings
 
 PATTERN is a tree-sitter query pattern. (See manual for how to
 write query patterns.)  This pattern should capture nodes with
-either face names or function names.  If captured with a face
-name, the node's corresponding text in the buffer is fontified
-with that face; if captured with a function name, the function is
-called with three arguments, BEG END NODE, where BEG and END
+either face symbols or function symbols.  If captured with a face
+symbol, the node's corresponding text in the buffer is fontified
+with that face; if captured with a function symbol, the function
+is called with three arguments, BEG END NODE, where BEG and END
 marks the span of the corresponding text, and NODE is the node
-itself.")
+itself.  If a symbol is both a face and a function, it is treated
+as a face.")
 
 (defun tree-sitter-fontify-region-function (beg end &optional verbose)
   "Fontify the region between BEG and END.
diff --git a/src/eval.c b/src/eval.c
index 18faa0b9b1..33c0763f38 100644
--- a/src/eval.c
+++ b/src/eval.c
@@ -1956,6 +1956,19 @@ signal_error (const char *s, Lisp_Object arg)
   xsignal (Qerror, Fcons (build_string (s), arg));
 }
 
+void
+define_error (Lisp_Object name, const char *message, Lisp_Object parent)
+{
+  eassert (SYMBOLP (name));
+  eassert (SYMBOLP (parent));
+  Lisp_Object parent_conditions = Fget (parent, Qerror_conditions);
+  eassert (CONSP (parent_conditions));
+  eassert (!NILP (Fmemq (parent, parent_conditions)));
+  eassert (NILP (Fmemq (name, parent_conditions)));
+  Fput (name, Qerror_conditions, pure_cons (name, parent_conditions));
+  Fput (name, Qerror_message, build_pure_c_string (message));
+}
+
 /* Use this for arithmetic overflow, e.g., when an integer result is
    too large even for a bignum.  */
 void
diff --git a/src/json.c b/src/json.c
index 3f1d27ad7f..ff28143a3c 100644
--- a/src/json.c
+++ b/src/json.c
@@ -1098,22 +1098,6 @@ DEFUN ("json-parse-buffer", Fjson_parse_buffer, Sjson_parse_buffer,
   return unbind_to (count, lisp);
 }
 
-/* Simplified version of 'define-error' that works with pure
-   objects.  */
-
-static void
-define_error (Lisp_Object name, const char *message, Lisp_Object parent)
-{
-  eassert (SYMBOLP (name));
-  eassert (SYMBOLP (parent));
-  Lisp_Object parent_conditions = Fget (parent, Qerror_conditions);
-  eassert (CONSP (parent_conditions));
-  eassert (!NILP (Fmemq (parent, parent_conditions)));
-  eassert (NILP (Fmemq (name, parent_conditions)));
-  Fput (name, Qerror_conditions, pure_cons (name, parent_conditions));
-  Fput (name, Qerror_message, build_pure_c_string (message));
-}
-
 void
 syms_of_json (void)
 {
diff --git a/src/lisp.h b/src/lisp.h
index e439447283..d30509b61a 100644
--- a/src/lisp.h
+++ b/src/lisp.h
@@ -5127,6 +5127,11 @@ maybe_gc (void)
     maybe_garbage_collect ();
 }
 
+/* Simplified version of 'define-error' that works with pure
+   objects.  */
+void
+define_error (Lisp_Object name, const char *message, Lisp_Object parent);
+
 INLINE_HEADER_END
 
 #endif /* EMACS_LISP_H */
diff --git a/src/tree_sitter.c b/src/tree_sitter.c
index e9f8ddc7e3..5e16df7758 100644
--- a/src/tree_sitter.c
+++ b/src/tree_sitter.c
@@ -19,17 +19,8 @@ Copyright (C) 2021 Free Software Foundation, Inc.
 
 #include <config.h>
 
-#include <sys/types.h>
-#include <sys/stat.h>
-#include <sys/param.h>
-#include <errno.h>
-#include <stdio.h>
-#include <stdlib.h>
-#include <unistd.h>
-
 #include "lisp.h"
 #include "buffer.h"
-#include "coding.h"
 #include "tree_sitter.h"
 
 /* parser.h defines a macro ADVANCE that conflicts with alloc.c.  */
@@ -61,6 +52,16 @@ DEFUN ("tree-sitter-node-p",
 
 /*** Parsing functions */
 
+static inline void
+ts_tree_edit_1 (TSTree *tree, ptrdiff_t start_byte,
+		ptrdiff_t old_end_byte, ptrdiff_t new_end_byte)
+{
+  TSPoint dummy_point = {0, 0};
+  TSInputEdit edit = {start_byte, old_end_byte, new_end_byte,
+		      dummy_point, dummy_point, dummy_point};
+  ts_tree_edit (tree, &edit);
+}
+
 /* Update each parser's tree after the user made an edit.  This
 function does not parse the buffer and only updates the tree. (So it
 should be very fast.)  */
@@ -68,18 +69,38 @@ DEFUN ("tree-sitter-node-p",
 ts_record_change (ptrdiff_t start_byte, ptrdiff_t old_end_byte,
 		  ptrdiff_t new_end_byte)
 {
+  eassert(start_byte <= old_end_byte);
+  eassert(start_byte <= new_end_byte);
+
   Lisp_Object parser_list = Fsymbol_value (Qtree_sitter_parser_list);
-  TSPoint dummy_point = {0, 0};
-  TSInputEdit edit = {start_byte, old_end_byte, new_end_byte,
-		      dummy_point, dummy_point, dummy_point};
+
   while (!NILP (parser_list))
     {
       Lisp_Object lisp_parser = Fcar (parser_list);
       TSTree *tree = XTS_PARSER (lisp_parser)->tree;
       if (tree != NULL)
-	ts_tree_edit (tree, &edit);
-      XTS_PARSER (lisp_parser)->need_reparse = true;
-      parser_list = Fcdr (parser_list);
+	{
+	  /* We "clip" the change to between visible_beg and
+	     visible_end.  It is okay if visible_end ends up larger
+	     than BUF_Z, tree-sitter only access buffer text during
+	     re-parse, and we will adjust visible_beg/end before
+	     re-parse.  */
+	  ptrdiff_t visible_beg = XTS_PARSER (lisp_parser)->visible_beg;
+	  ptrdiff_t visible_end = XTS_PARSER (lisp_parser)->visible_end;
+
+	  ptrdiff_t visible_start =
+	    max (visible_beg, start_byte) - visible_beg;
+	  ptrdiff_t visible_old_end =
+	    min (visible_end, old_end_byte) - visible_beg;
+	  ptrdiff_t visible_new_end =
+	    min (visible_end, new_end_byte) - visible_beg;
+
+	  ts_tree_edit_1 (tree, visible_start, visible_old_end,
+			  visible_new_end);
+	  XTS_PARSER (lisp_parser)->need_reparse = true;
+
+	  parser_list = Fcdr (parser_list);
+	}
     }
 }
 
@@ -93,16 +114,67 @@ ts_ensure_parsed (Lisp_Object parser)
   TSParser *ts_parser = XTS_PARSER (parser)->parser;
   TSTree *tree = XTS_PARSER(parser)->tree;
   TSInput input = XTS_PARSER (parser)->input;
+  struct buffer *buffer = XTS_PARSER (parser)->buffer;
+
+  /* Before we parse, catch up with the narrowing situation.  We
+     change visible_beg and visible_end to match BUF_BEGV_BYTE and
+     BUF_ZV_BYTE, and inform tree-sitter of the change.  */
+  ptrdiff_t visible_beg = XTS_PARSER (parser)->visible_beg;
+  ptrdiff_t visible_end = XTS_PARSER (parser)->visible_end;
+  /* Before re-parse, we want to move the visible range of tree-sitter
+     to matched the narrowed range. For example:
+     Move ________|____|__
+     to   |____|__________ */
+
+  /* 1. Make sure visible_beg <= BUF_BEGV_BYTE.  */
+  if (visible_beg > BUF_BEGV_BYTE (buffer))
+    {
+      /* Tree-sitter sees: insert at the beginning. */
+      ts_tree_edit_1 (tree, 0, 0, visible_beg - BUF_BEGV_BYTE (buffer));
+      visible_beg = BUF_BEGV_BYTE (buffer);
+    }
+  /* 2. Make sure visible_end = BUF_ZV_BYTE.  */
+  if (visible_end < BUF_ZV_BYTE (buffer))
+    {
+      /* Tree-sitter sees: insert at the end.  */
+      ts_tree_edit_1 (tree, visible_end - visible_beg,
+		      visible_end - visible_beg,
+		      BUF_ZV_BYTE (buffer) - visible_beg);
+      visible_end = BUF_ZV_BYTE (buffer);
+    }
+  else if (visible_end > BUF_ZV_BYTE (buffer))
+    {
+      /* Tree-sitter sees: delete at the end.  */
+      ts_tree_edit_1 (tree, BUF_ZV_BYTE (buffer) - visible_beg,
+		      visible_end - visible_beg,
+		      BUF_ZV_BYTE (buffer) - visible_beg);
+      visible_end = BUF_ZV_BYTE (buffer);
+    }
+  /* 3. Make sure visible_beg = BUF_BEGV_BYTE.  */
+  if (visible_beg < BUF_BEGV_BYTE (buffer))
+    {
+      /* Tree-sitter sees: delete at the beginning.  */
+      ts_tree_edit_1 (tree, 0, BUF_BEGV_BYTE (buffer) - visible_beg, 0);
+      visible_beg = BUF_BEGV_BYTE (buffer);
+    }
+  XTS_PARSER (parser)->visible_beg = visible_beg;
+  XTS_PARSER (parser)->visible_end = visible_end;
+
   TSTree *new_tree = ts_parser_parse(ts_parser, tree, input);
-  /* This should be very rare: it only happens when 1) language is not
-     set (impossible in Emacs because the user has to supply a
-     language to create a parser), 2) parse canceled due to timeout
-     (impossible because we don't set a timeout), 3) parse canceled
-     due to cancellation flag (impossible because we don't set the
-     flag).  (See comments for ts_parser_parse in
+  /* This should be very rare (impossible, really): it only happens
+     when 1) language is not set (impossible in Emacs because the user
+     has to supply a language to create a parser), 2) parse canceled
+     due to timeout (impossible because we don't set a timeout), 3)
+     parse canceled due to cancellation flag (impossible because we
+     don't set the flag).  (See comments for ts_parser_parse in
      tree_sitter/api.h.)  */
   if (new_tree == NULL)
-    signal_error ("Parse failed", parser);
+    {
+      Lisp_Object buf;
+      XSETBUFFER(buf, buffer);
+      xsignal1 (Qtree_sitter_parse_error, buf);
+    }
+
   ts_tree_delete (tree);
   XTS_PARSER (parser)->tree = new_tree;
   XTS_PARSER (parser)->need_reparse = false;
@@ -110,13 +182,18 @@ ts_ensure_parsed (Lisp_Object parser)
 }
 
 /* This is the read function provided to tree-sitter to read from a
-   buffer.  It reads one character at a time and automatically skip
+   buffer.  It reads one character at a time and automatically skips
    the gap.  */
 const char*
-ts_read_buffer (void *buffer, uint32_t byte_index,
+ts_read_buffer (void *parser, uint32_t byte_index,
 		TSPoint position, uint32_t *bytes_read)
 {
-  ptrdiff_t byte_pos = byte_index + 1;
+  struct buffer *buffer = ((struct Lisp_TS_Parser *) parser)->buffer;
+  ptrdiff_t visible_beg = ((struct Lisp_TS_Parser *) parser)->visible_beg;
+  ptrdiff_t byte_pos = byte_index + visible_beg;
+  /* We will make sure visible_beg >= BUF_BEG_BYTE before re-parse (in
+     ts_ensure_parsed), so byte_pos will never be smaller than
+     BUF_BEG_BYTE (unless byte_index < 0).  */
 
   /* Read one character.  Tree-sitter wants us to set bytes_read to 0
      if it reads to the end of buffer.  It doesn't say what it wants
@@ -126,26 +203,26 @@ ts_read_buffer (void *buffer, uint32_t byte_index,
   int len;
   /* This function could run from a user command, so it is better to
      do nothing instead of raising an error. (It was a pain in the a**
-     to read mega-if-conditions in Emacs source, so I write the two
-     branches separately, hoping the compiler can merge them.)  */
-  if (!BUFFER_LIVE_P ((struct buffer *) buffer))
+     to decrypt mega-if-conditions in Emacs source, so I wrote the two
+     branches separately.)  */
+  if (!BUFFER_LIVE_P (buffer))
     {
       beg = "";
       len = 0;
     }
-  // TODO BUF_ZV_BYTE?
-  else if (byte_pos >= BUF_Z_BYTE ((struct buffer *) buffer))
+  /* Reached visible end-of-buffer, tell tree-sitter to read no more.  */
+  else if (byte_pos >= BUF_ZV_BYTE (buffer))
     {
       beg = "";
       len = 0;
     }
+  /* Normal case, read a character.  */
   else
     {
       beg = (char *) BUF_BYTE_ADDRESS (buffer, byte_pos);
-      len = BYTES_BY_CHAR_HEAD ((int) beg);
+      len = BYTES_BY_CHAR_HEAD ((int) *beg);
     }
   *bytes_read = (uint32_t) len;
-
   return beg;
 }
 
@@ -158,13 +235,16 @@ make_ts_parser (struct buffer *buffer, TSParser *parser,
 {
   struct Lisp_TS_Parser *lisp_parser
     = ALLOCATE_PSEUDOVECTOR (struct Lisp_TS_Parser, name, PVEC_TS_PARSER);
+
   lisp_parser->name = name;
   lisp_parser->buffer = buffer;
   lisp_parser->parser = parser;
   lisp_parser->tree = tree;
-  TSInput input = {buffer, ts_read_buffer, TSInputEncodingUTF8};
+  TSInput input = {lisp_parser, ts_read_buffer, TSInputEncodingUTF8};
   lisp_parser->input = input;
   lisp_parser->need_reparse = true;
+  lisp_parser->visible_beg = BUF_BEGV (buffer);
+  lisp_parser->visible_end = BUF_ZV (buffer);
   return make_lisp_ptr (lisp_parser, Lisp_Vectorlike);
 }
 
@@ -287,7 +367,7 @@ DEFUN ("tree-sitter-parse-string",
   /* See comment in ts_ensure_parsed for possible reasons for a
      failure.  */
   if (tree == NULL)
-    signal_error ("Failed to parse STRING", string);
+    xsignal1 (Qtree_sitter_parse_error, string);
 
   TSNode root_node = ts_tree_root_node (tree);
 
@@ -535,7 +615,9 @@ DEFUN ("tree-sitter-node-first-child-for-byte",
 {
   CHECK_INTEGER (pos);
 
-  struct buffer *buf = (XTS_PARSER (XTS_NODE (node)->parser)->buffer);
+  struct buffer *buf = XTS_PARSER (XTS_NODE (node)->parser)->buffer;
+  ptrdiff_t visible_beg =
+    XTS_PARSER (XTS_NODE (node)->parser)->visible_beg;
   ptrdiff_t byte_pos = XFIXNUM (pos);
 
   if (byte_pos < BUF_BEGV_BYTE (buf) || byte_pos > BUF_ZV_BYTE (buf))
@@ -544,9 +626,10 @@ DEFUN ("tree-sitter-node-first-child-for-byte",
   TSNode ts_node = XTS_NODE (node)->node;
   TSNode child;
   if (NILP (named))
-    child = ts_node_first_child_for_byte (ts_node, byte_pos - 1);
+    child = ts_node_first_child_for_byte (ts_node, byte_pos - visible_beg);
   else
-    child = ts_node_first_named_child_for_byte (ts_node, byte_pos - 1);
+    child = ts_node_first_named_child_for_byte
+      (ts_node, byte_pos - visible_beg);
 
   if (ts_node_is_null(child))
     return Qnil;
@@ -566,7 +649,9 @@ DEFUN ("tree-sitter-node-descendant-for-byte-range",
   CHECK_INTEGER (beg);
   CHECK_INTEGER (end);
 
-  struct buffer *buf = (XTS_PARSER (XTS_NODE (node)->parser)->buffer);
+  struct buffer *buf = XTS_PARSER (XTS_NODE (node)->parser)->buffer;
+  ptrdiff_t visible_beg =
+    XTS_PARSER (XTS_NODE (node)->parser)->visible_beg;
   ptrdiff_t byte_beg = XFIXNUM (beg);
   ptrdiff_t byte_end = XFIXNUM (end);
 
@@ -580,10 +665,10 @@ DEFUN ("tree-sitter-node-descendant-for-byte-range",
   TSNode child;
   if (NILP (named))
     child = ts_node_descendant_for_byte_range
-      (ts_node, byte_beg - 1 , byte_end - 1);
+      (ts_node, byte_beg - visible_beg , byte_end - visible_beg);
   else
     child = ts_node_named_descendant_for_byte_range
-      (ts_node, byte_beg - 1, byte_end - 1);
+      (ts_node, byte_beg - visible_beg, byte_end - visible_beg);
 
   if (ts_node_is_null(child))
     return Qnil;
@@ -593,31 +678,24 @@ DEFUN ("tree-sitter-node-descendant-for-byte-range",
 
 /* Query functions */
 
-Lisp_Object ts_query_error_to_string (TSQueryError error)
+char*
+ts_query_error_to_string (TSQueryError error)
 {
-  char *error_name;
   switch (error)
     {
     case TSQueryErrorNone:
-      error_name = "none";
-      break;
+      return "none";
     case TSQueryErrorSyntax:
-      error_name = "syntax";
-      break;
+      return "syntax";
     case TSQueryErrorNodeType:
-      error_name = "node type";
-      break;
+      return "node type";
     case TSQueryErrorField:
-      error_name = "field";
-      break;
+      return "field";
     case TSQueryErrorCapture:
-      error_name = "capture";
-      break;
+      return "capture";
     case TSQueryErrorStructure:
-      error_name = "structure";
-      break;
+      return "structure";
     }
-  return  make_pure_c_string (error_name, strlen(error_name));
 }
 
 DEFUN ("tree-sitter-query-capture",
@@ -634,7 +712,7 @@ DEFUN ("tree-sitter-query-capture",
 BEG and END, if _both_ non-nil, specifies the range in which the query
 is executed.
 
-Return nil if the query failed.  */)
+Raise an tree-sitter-query-error if PATTERN is malformed.  */)
   (Lisp_Object node, Lisp_Object pattern,
    Lisp_Object beg, Lisp_Object end)
 {
@@ -643,47 +721,56 @@ DEFUN ("tree-sitter-query-capture",
 
   TSNode ts_node = XTS_NODE (node)->node;
   Lisp_Object lisp_parser = XTS_NODE (node)->parser;
+  ptrdiff_t visible_beg =
+    XTS_PARSER (XTS_NODE (node)->parser)->visible_beg;
   const TSLanguage *lang = ts_parser_language
     (XTS_PARSER (lisp_parser)->parser);
   char *source = SSDATA (pattern);
 
+
   uint32_t error_offset;
-  uint32_t error_type;
+  TSQueryError error_type;
   TSQuery *query = ts_query_new (lang, source, strlen (source),
 				 &error_offset, &error_type);
   TSQueryCursor *cursor = ts_query_cursor_new ();
 
   if (query == NULL)
     {
-      // FIXME: Signal an error?
-      return Qnil;
+      // FIXME: Still crashes, debug when I can get a gdb.
+      xsignal2 (Qtree_sitter_query_error,
+		make_fixnum (error_offset),
+		build_string (ts_query_error_to_string (error_type)));
     }
   if (!NILP (beg) && !NILP (end))
     {
       EMACS_INT beg_byte = XFIXNUM (beg);
       EMACS_INT end_byte = XFIXNUM (end);
       ts_query_cursor_set_byte_range
-	(cursor, (uint32_t) beg_byte - 1, (uint32_t) end_byte - 1);
+	(cursor, (uint32_t) beg_byte - visible_beg,
+	 (uint32_t) end_byte - visible_beg);
     }
 
   ts_query_cursor_exec (cursor, query, ts_node);
   TSQueryMatch match;
-  TSQueryCapture capture;
+
   Lisp_Object result = Qnil;
-  Lisp_Object entry;
-  Lisp_Object captured_node;
-  const char *capture_name;
-  uint32_t capture_name_len;
   while (ts_query_cursor_next_match (cursor, &match))
     {
       const TSQueryCapture *captures = match.captures;
       for (int idx=0; idx < match.capture_count; idx++)
 	{
+	  TSQueryCapture capture;
+	  Lisp_Object captured_node;
+	  const char *capture_name;
+	  Lisp_Object entry;
+	  uint32_t capture_name_len;
+
 	  capture = captures[idx];
 	  captured_node = make_ts_node(lisp_parser, capture.node);
 	  capture_name = ts_query_capture_name_for_id
 	    (query, capture.index, &capture_name_len);
-	  entry = Fcons (intern_c_string (capture_name),
+	  entry = Fcons (intern_c_string_1
+			 (capture_name, capture_name_len),
 			 captured_node);
 	  result = Fcons (entry, result);
 	}
@@ -705,11 +792,15 @@ syms_of_tree_sitter (void)
   DEFSYM (Qhas_changes, "has-changes");
   DEFSYM (Qhas_error, "has-error");
 
+  DEFSYM(Qtree_sitter_error, "tree-sitter-error");
   DEFSYM (Qtree_sitter_query_error, "tree-sitter-query-error");
-  Fput (Qtree_sitter_query_error, Qerror_conditions,
-	pure_list (Qtree_sitter_query_error, Qerror));
-  Fput (Qtree_sitter_query_error, Qerror_message,
-	build_pure_c_string ("Error with query pattern"))
+  DEFSYM (Qtree_sitter_parse_error, "tree-sitter-parse-error")
+  define_error (Qtree_sitter_error, "Generic tree-sitter error", Qerror);
+  define_error (Qtree_sitter_query_error, "Query pattern is malformed",
+		Qtree_sitter_error);
+  define_error (Qtree_sitter_parse_error, "Parse failed",
+		Qtree_sitter_error);
+
 
   DEFSYM (Qtree_sitter_parser_list, "tree-sitter-parser-list");
   DEFVAR_LISP ("tree-sitter-parser-list", Vtree_sitter_parser_list,
diff --git a/src/tree_sitter.h b/src/tree_sitter.h
index e9b4a71326..7e0fec0ee9 100644
--- a/src/tree_sitter.h
+++ b/src/tree_sitter.h
@@ -20,8 +20,6 @@ Copyright (C) 2021 Free Software Foundation, Inc.
 #ifndef EMACS_TREE_SITTER_H
 #define EMACS_TREE_SITTER_H
 
-#include <sys/types.h>
-
 #include "lisp.h"
 
 #include <tree_sitter/api.h>
@@ -33,12 +31,25 @@ #define EMACS_TREE_SITTER_H
 struct Lisp_TS_Parser
 {
   union vectorlike_header header;
+  /* A parser's name is just a convenient tag, see docstring for
+     'tree-sitter-make-parser', and 'tree-sitter-get-parser'. */
   Lisp_Object name;
   struct buffer *buffer;
   TSParser *parser;
   TSTree *tree;
   TSInput input;
+  /* Re-parsing an unchanged buffer is not free for tree-sitter, so we
+     only make it re-parse when need_reparse == true.  That usually
+     means some change is made in the buffer.  But others could set
+     this field to true to force tree-sitter to re-parse.  */
   bool need_reparse;
+  /* This two positions record the byte position of the "visible
+     region" that tree-sitter sees.  Unlike markers, These two
+     positions do not change as the user inserts and deletes text
+     around them. Before re-parse, we move these positions to match
+     BUF_BEGV_BYTE and BUF_ZV_BYTE.  */
+  ptrdiff_t visible_beg;
+  ptrdiff_t visible_end;
 };
 
 /* A wrapper around a tree-sitter node.  */
diff --git a/test/src/tree-sitter-tests.el b/test/src/tree-sitter-tests.el
index c61ad678d2..69104568de 100644
--- a/test/src/tree-sitter-tests.el
+++ b/test/src/tree-sitter-tests.el
@@ -148,5 +148,58 @@ tree-sitter-query-api
                          (cdr entry))))
                 (tree-sitter-query-capture root-node pattern)))))))
 
+(ert-deftest tree-sitter-narrow ()
+  "Tests if narrowing works."
+  (with-temp-buffer
+    (let (parser root-node pattern doc-node object-node pair-node)
+      (progn
+        (insert "xxx[1,{\"name\": \"Bob\"},2,3]xxx")
+        (narrow-to-region (+ (point-min) 3) (- (point-max) 3))
+        (setq parser (tree-sitter-create-parser
+                      (current-buffer) (tree-sitter-json)))
+        (setq root-node (tree-sitter-parser-root-node
+                         parser)))
+      ;; This test is from the basic test.
+      (should
+       (equal
+        (tree-sitter-node-string
+         (tree-sitter-parser-root-node parser))
+        "(document (array (number) (object (pair key: (string (string_content)) value: (string (string_content)))) (number) (number)))"))
+
+      (widen)
+      (goto-char (point-min))
+      (insert "ooo")
+      (should (equal "oooxxx[1,{\"name\": \"Bob\"},2,3]xxx"
+                     (buffer-string)))
+      (delete-region 10 26)
+      (should (equal "oooxxx[1,2,3]xxx"
+                     (buffer-string)))
+      (narrow-to-region (+ (point-min) 6) (- (point-max) 3))
+      ;; This test is also from the basic test.
+      (should
+       (equal (tree-sitter-node-string
+               (tree-sitter-parser-root-node parser))
+              "(document (array (number) (number) (number)))"))
+      (widen)
+      (goto-char (point-max))
+      (insert "[1,2]")
+      (should (equal "oooxxx[1,2,3]xxx[1,2]"
+                     (buffer-string)))
+      (narrow-to-region (- (point-max) 5) (point-max))
+      (should
+       (equal (tree-sitter-node-string
+               (tree-sitter-parser-root-node parser))
+              "(document (array (number) (number)))"))
+      (widen)
+      (goto-char (point-min))
+      (insert "[1]")
+      (should (equal "[1]oooxxx[1,2,3]xxx[1,2]"
+                     (buffer-string)))
+      (narrow-to-region (point-min) (+ (point-min) 3))
+      (should
+       (equal (tree-sitter-node-string
+               (tree-sitter-parser-root-node parser))
+              "(document (array (number)))")))))
+
 (provide 'tree-sitter-tests)
 ;;; tree-sitter-tests.el ends here
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-28 16:41                                                                       ` Eli Zaretskii
@ 2021-07-29 22:58                                                                         ` Stephen Leake
  2021-07-30  6:00                                                                           ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Stephen Leake @ 2021-07-29 22:58 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Yuan Fu, cpitclaudel, monnier, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Yuan Fu <casouri@gmail.com>
>> Date: Wed, 28 Jul 2021 12:36:33 -0400
>> Cc: Eli Zaretskii <eliz@gnu.org>,
>>  emacs-devel <emacs-devel@gnu.org>,
>>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>>  monnier@iro.umontreal.ca
>> 
>> > So don't send a change that deletes the hidden text; just send changes
>> > in the visible part of the text (that's the only place the user can make
>> > changes). tree-sitter will only run the scanner on the change regions,
>> > so it will only request text from the visible part of the buffer;
>> > all the requests will succeed.
>> 
>> Then we are not hiding the hidden text from tree-sitter. The
>> implementation you described, IIUC, is essentially do nothing
>> special when the buffer is narrowed.
>
> If the TS parser is called while the narrowing is in effect, it will
> be unable to access text beyond BEGV..ZV.  So in that case the
> narrowing _will_ affect TS.

Please read again; TS is affected in principle, but in practice, in the
absence of programming errors, it will never try to access text outside
the narrowing, so it won't notice.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-28 17:54                                                                           ` Eli Zaretskii
  2021-07-28 18:46                                                                             ` Yuan Fu
@ 2021-07-29 23:01                                                                             ` Stephen Leake
  1 sibling, 0 replies; 284+ messages in thread
From: Stephen Leake @ 2021-07-29 23:01 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Yuan Fu, emacs-devel, cpitclaudel, monnier

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Yuan Fu <casouri@gmail.com>
>> Date: Wed, 28 Jul 2021 13:47:42 -0400
>> Cc: Stephen Leake <stephen_leake@stephe-leake.org>,
>>  cpitclaudel@gmail.com,
>>  monnier@iro.umontreal.ca,
>>  emacs-devel@gnu.org
>> 
>> Could you describe the desired effect on tree-sitter when the buffer is narrowed?
>
> The behavior should be the same as if the text before and after the
> narrowed region didn't exist.

That would be true for the multi-major-mode use case, but not for the
temporarily narrow-to-defun case.

In other words, this should be up to the major mode to determine; the
low-level code should support either case (there are probably other use
cases out there).

>> If we just deny accessibility of the hidden region from tree-sitter,
>> tree-sitter is still aware of the hidden text, because it has
>> previously parsed the hidden text and stored the result in the parse
>> tree.
>
> The adherence to narrowing is for the use cases where TS is _always_
> invoked on the same narrowed region.  

right; the multi-major-mode case.

> You seem to be thinking about changes in the narrowing while TS is
> parsing, or between consecutive re-parsing calls, but I see no
> interesting/important use cases which would need to do that. And if
> there are some tricky cases which do need this, the respective Lisp
> programs will have to deal with the problem.

Right.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-28 18:46                                                                             ` Yuan Fu
  2021-07-28 19:00                                                                               ` Eli Zaretskii
@ 2021-07-29 23:06                                                                               ` Stephen Leake
  2021-07-30  0:35                                                                               ` Richard Stallman
  2 siblings, 0 replies; 284+ messages in thread
From: Stephen Leake @ 2021-07-29 23:06 UTC (permalink / raw)
  To: Yuan Fu; +Cc: Eli Zaretskii, emacs-devel, cpitclaudel, monnier

Yuan Fu <casouri@gmail.com> writes:

>> On Jul 28, 2021, at 1:54 PM, Eli Zaretskii <eliz@gnu.org> wrote:
>>
>>> From: Yuan Fu <casouri@gmail.com>

>> The adherence to narrowing is for the use cases where TS is _always_
>> invoked on the same narrowed region.  You seem to be thinking about
>> changes in the narrowing while TS is parsing, or between consecutive
>> re-parsing calls, but I see no interesting/important use cases which
>> would need to do that.  And if there are some tricky cases which do
>> need this, the respective Lisp programs will have to deal with the
>> problem.
>
> That makes sense. However it bring up a problem. Consider such a
> buffer: XXAAXX.

There is always a delimiter in the text that defines the boundary
between XX and AA; say "{{" for example, with "}}" at the other end of
AA.

> Say lisp narrows to AA and creates a tree-sitter parser. Then lisp
> widens the buffer, and user inserts B in front of AA. Now the buffer
> is XXBAAXX.

before or after the delimiter?

XX {{ BAA }} XX :   B is a change to AA

XXB {{ AA }} XX :   B is a change to XX


> Emacs has two options to convey this change to the tree-sitter parser:
> 1) it does not, then tree-sitter still thinks the buffer is AA,
> essentially the portion where tree-sitter sees is pushed forward by
> one character, 2) it tells tree-sitter the user inserted a character
> at the beginning, then tree-sitter thinks the buffer is BAA. Which
> option is correct depends on how does lisp later narrows: if lisp
> narrows to AA, then option 1 is correct, if lisp narrows to BAA, then
> option 2 is correct. But how do we know which option is correct before
> lisp narrows?

The major mode determines the boundaries and the narrowing, so leave it
up to that code to be consistent, not your code.

--
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-28 18:46                                                                             ` Yuan Fu
  2021-07-28 19:00                                                                               ` Eli Zaretskii
  2021-07-29 23:06                                                                               ` How to add pseudo vector types Stephen Leake
@ 2021-07-30  0:35                                                                               ` Richard Stallman
  2021-07-30  0:46                                                                                 ` Alexandre Garreau
  2021-07-30  6:35                                                                                 ` Eli Zaretskii
  2 siblings, 2 replies; 284+ messages in thread
From: Richard Stallman @ 2021-07-30  0:35 UTC (permalink / raw)
  To: Yuan Fu; +Cc: eliz, emacs-devel, stephen_leake, cpitclaudel, monnier

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > s two options to convey this change to the tree-sitter parser: 1)
  > it does not, then tree-sitter still thinks the buffer is AA,
  > essentially the portion where tree-sitter sees is pushed forward
  > by one character, 2) it tells tree-sitter the user inserted a
  > character at the beginning, then tree-sitter thinks the buffer is
  > BAA.

  > Which option is correct depends on how does lisp later narrows: if
  > lisp narrows to AA, then option 1 is correct, if lisp narrows to
  > BAA, then option 2 is correct. But how do we know which option is
  > correct before lisp narrows?

I suggest we create a way for the program to declare the purpose for
each instance of narrowing.

I know of two kinds of purposes for using narrowing.

1. To focus operations on syntactic entity in a buffer containing
other things which are essentially unrelated.  Let's call this "semantic" narrowing.

For instance, when Rmail narrows the file buffer to just one message,
that is semantic narrowing.  Whatever is outside the buffer bounds is
unrelated to parsing the current message.

2. To show just part of the text you're looking at.  This is a display
feature, usually temporary, and would be enabled or disabled by the
user.  Let's call it "display" narrowing.

I don't think Emacs can tell heuristically which kind of narrowing a
program is doing.

I propose we create a way for Lisp programs to declare when they do
semantic narrowing.  They could specify markers for the beginning and
end of that narrowing.

Facilities for parsing the buffer should heed semantic narrowing but
disregard display narrowing.

Various kinds of semantic narrowing should be able to nest, and
display narrowing should be able to nest inside semantic narrowings.

Comments or critiques?

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)





^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-30  0:35                                                                               ` Richard Stallman
@ 2021-07-30  0:46                                                                                 ` Alexandre Garreau
  2021-07-30  6:35                                                                                 ` Eli Zaretskii
  1 sibling, 0 replies; 284+ messages in thread
From: Alexandre Garreau @ 2021-07-30  0:46 UTC (permalink / raw)
  To: emacs-devel, rms; +Cc: Yuan Fu, stephen_leake, eliz, cpitclaudel, monnier

Le vendredi 30 juillet 2021, 02:35:33 CEST Richard Stallman a écrit :
> I propose we create a way for Lisp programs to declare when they do
> semantic narrowing.  They could specify markers for the beginning and
> end of that narrowing.
> 
> Facilities for parsing the buffer should heed semantic narrowing but
> disregard display narrowing.
> 
> Various kinds of semantic narrowing should be able to nest, and
> display narrowing should be able to nest inside semantic narrowings.
> 
> Comments or critiques?

Nesting? that’s very interesting, I always felt that emacs’ separation of 
data in “atomic” buffers, unnested, was limiting…  Couldn’t a such facility 
come with some semantics that could ease the working with multi-modes and 
multiple-formats files, such as php files (including html), html pages 
(including javascript and css), org-mode and its source blocks (currently 
opening another buffer to work), makefiles including a lot of shell-script 
programs, bison/yacc files including C, etc.?

PS: that makes me think of some other reaaaally handy feature that would 
be so convenient: the ability to *include* the content of a buffer inside 
some other buffer, so both’s data are connected, and you can see many small 
files’ content at once while working on some multi-semantics file… but maybe 
it’s a stupid/useless idea (it could be synchronized maybe? or be overly 
difficult, dunno u.u)



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-29 22:58                                                                         ` Stephen Leake
@ 2021-07-30  6:00                                                                           ` Eli Zaretskii
  0 siblings, 0 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-30  6:00 UTC (permalink / raw)
  To: Stephen Leake; +Cc: casouri, cpitclaudel, monnier, emacs-devel

> From: Stephen Leake <stephen_leake@stephe-leake.org>
> Cc: Yuan Fu <casouri@gmail.com>,  emacs-devel@gnu.org,
>   cpitclaudel@gmail.com,  monnier@iro.umontreal.ca
> Date: Thu, 29 Jul 2021 15:58:39 -0700
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> From: Yuan Fu <casouri@gmail.com>
> >> Date: Wed, 28 Jul 2021 12:36:33 -0400
> >> Cc: Eli Zaretskii <eliz@gnu.org>,
> >>  emacs-devel <emacs-devel@gnu.org>,
> >>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
> >>  monnier@iro.umontreal.ca
> >> 
> >> > So don't send a change that deletes the hidden text; just send changes
> >> > in the visible part of the text (that's the only place the user can make
> >> > changes). tree-sitter will only run the scanner on the change regions,
> >> > so it will only request text from the visible part of the buffer;
> >> > all the requests will succeed.
> >> 
> >> Then we are not hiding the hidden text from tree-sitter. The
> >> implementation you described, IIUC, is essentially do nothing
> >> special when the buffer is narrowed.
> >
> > If the TS parser is called while the narrowing is in effect, it will
> > be unable to access text beyond BEGV..ZV.  So in that case the
> > narrowing _will_ affect TS.
> 
> Please read again; TS is affected in principle, but in practice, in the
> absence of programming errors, it will never try to access text outside
> the narrowing, so it won't notice.

Sorry, I don't understand what you wanted me to re-read.  As the
subsequent discussions revealed, Yuan had in mind a scenario where the
text outside of the restriction was changed.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-30  0:35                                                                               ` Richard Stallman
  2021-07-30  0:46                                                                                 ` Alexandre Garreau
@ 2021-07-30  6:35                                                                                 ` Eli Zaretskii
  1 sibling, 0 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-30  6:35 UTC (permalink / raw)
  To: rms; +Cc: casouri, emacs-devel, stephen_leake, cpitclaudel, monnier

> From: Richard Stallman <rms@gnu.org>
> Cc: eliz@gnu.org, cpitclaudel@gmail.com,
> 	stephen_leake@stephe-leake.org, monnier@iro.umontreal.ca,
> 	emacs-devel@gnu.org
> Date: Thu, 29 Jul 2021 20:35:33 -0400
> 
> I suggest we create a way for the program to declare the purpose for
> each instance of narrowing.
> 
> I know of two kinds of purposes for using narrowing.
> 
> 1. To focus operations on syntactic entity in a buffer containing
> other things which are essentially unrelated.  Let's call this "semantic" narrowing.
> 
> For instance, when Rmail narrows the file buffer to just one message,
> that is semantic narrowing.  Whatever is outside the buffer bounds is
> unrelated to parsing the current message.
> 
> 2. To show just part of the text you're looking at.  This is a display
> feature, usually temporary, and would be enabled or disabled by the
> user.  Let's call it "display" narrowing.

So another way of discerning between the two is to distinguish the
"Lisp narrowing" from the "user narrowing".

> I don't think Emacs can tell heuristically which kind of narrowing a
> program is doing.

If we agree that the second kind is only done by the user, then no
heuristic is needed.

But I agree that having this recorded explicitly would be a good idea.
We could provide something similar to prog-indentation-context for
this purpose.

> I propose we create a way for Lisp programs to declare when they do
> semantic narrowing.  They could specify markers for the beginning and
> end of that narrowing.
> 
> Facilities for parsing the buffer should heed semantic narrowing but
> disregard display narrowing.
> 
> Various kinds of semantic narrowing should be able to nest, and
> display narrowing should be able to nest inside semantic narrowings.
> 
> Comments or critiques?

We had a long discussion of a similar proposal, see

  https://lists.gnu.org/archive/html/emacs-devel/2017-02/msg00765.html

At the time, we were unable to come to an agreed-upon design, so this
feature was never implemented in mainline Emacs.  Maybe we should
revisit it now.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-29 18:57                                                                                                 ` Yuan Fu
@ 2021-07-30  6:47                                                                                                   ` Eli Zaretskii
  2021-07-30 14:17                                                                                                     ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-07-30  6:47 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Thu, 29 Jul 2021 14:57:19 -0400
> Cc: Stephen Leake <stephen_leake@stephe-leake.org>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  Stefan Monnier <monnier@iro.umontreal.ca>,
>  emacs-devel@gnu.org
> 
> >> Actually, that sounds like how it works in my code right now. After the last few exchanges, I still have the feeling that we are not on the same page. Could you have a look at the code in ts_ensure_parsed and ts_record_change, and see if it aligns with what you consider to be the right thing? If you have read them already and think you understand what are they doing, could you tell me how exactly should these two functions behave, in your opinion? Thanks.
> > 
> > Where do I find the latest version of the code?
> 
> A few messages back I attached a patch, ts.5.patch. Actually I can just attach it again, here.

That's not the whole code, that's a patch against some previous
version of the code.  So I cannot answer your questions with 100%
certainty, until I see the entire code of the TS support.  For
example, I'm not sure I have a clear idea when are the two functions
ts_ensure_parsed and ts_record_change called.

That said, it looks like the code is correct: you should record the
changes in the entire buffer, but only pass to TS the changes inside
the restriction BEGV..ZV that is in effect at the time of the re-parse
call.  Btw, I don't see the code that filters changes reported to TS
by their positions against the restriction; did I miss something?

And one more question: I understand that ts_read_buffer doesn't check
against BUF_BEGV_BYTE because TS never reads before the "visible beg"
position, is that right?  But if so, why do we need the similar test
for BUF_ZV_BYTE? could TS attempt to read beyond the "visible end"?

Thanks.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-30  6:47                                                                                                   ` Eli Zaretskii
@ 2021-07-30 14:17                                                                                                     ` Yuan Fu
  2021-08-03 10:24                                                                                                       ` Fu Yuan
  2021-08-03 11:47                                                                                                       ` Eli Zaretskii
  0 siblings, 2 replies; 284+ messages in thread
From: Yuan Fu @ 2021-07-30 14:17 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Clément Pit-Claudel, Stephen Leake, Stefan Monnier, emacs-devel

> 
> That's not the whole code, that's a patch against some previous
> version of the code.  So I cannot answer your questions with 100%
> certainty, until I see the entire code of the TS support.  For
> example, I'm not sure I have a clear idea when are the two functions
> ts_ensure_parsed and ts_record_change called.

Oops, I thought you have all prior patches. You can clone the “ts” branch from 

https://github.com/casouri/emacs.git

If this is ok, I’ll push to this branch instead of sending patches from now on.

> 
> That said, it looks like the code is correct: you should record the
> changes in the entire buffer, but only pass to TS the changes inside
> the restriction BEGV..ZV that is in effect at the time of the re-parse
> call.  Btw, I don't see the code that filters changes reported to TS
> by their positions against the restriction; did I miss something?

Yes, I do clip the change to the visible portion:

ts_record_change (ptrdiff_t start_byte, ptrdiff_t old_end_byte,
		  ptrdiff_t new_end_byte)
{
  eassert(start_byte <= old_end_byte);
  eassert(start_byte <= new_end_byte);

  Lisp_Object parser_list = Fsymbol_value (Qtree_sitter_parser_list);

  while (!NILP (parser_list))
    {
      Lisp_Object lisp_parser = Fcar (parser_list);
      TSTree *tree = XTS_PARSER (lisp_parser)->tree;
      if (tree != NULL)
	{
	  /* We "clip" the change to between visible_beg and
	     visible_end.  It is okay if visible_end ends up larger
	     than BUF_Z, tree-sitter only access buffer text during
	     re-parse, and we will adjust visible_beg/end before
	     re-parse.  */
	  ptrdiff_t visible_beg = XTS_PARSER (lisp_parser)->visible_beg;
	  ptrdiff_t visible_end = XTS_PARSER (lisp_parser)->visible_end;

	  ptrdiff_t visible_start =
	    max (visible_beg, start_byte) - visible_beg;
	  ptrdiff_t visible_old_end =
	    min (visible_end, old_end_byte) - visible_beg;
	  ptrdiff_t visible_new_end =
	    min (visible_end, new_end_byte) - visible_beg;

	  ts_tree_edit_1 (tree, visible_start, visible_old_end,
			  visible_new_end);
	  XTS_PARSER (lisp_parser)->need_reparse = true;

	  parser_list = Fcdr (parser_list);
	}
    }
}

> And one more question: I understand that ts_read_buffer doesn't check
> against BUF_BEGV_BYTE because TS never reads before the "visible beg"
> position, is that right?  

Yes, we always update visible_beg and visible_end to match BUF_BEGV_BYTE and BUF_ZV_BYTE before we instruct tree-sitter to re-parse. So when tree-sitter reads at byte position 0, it translates to buffer byte position 0 + visible_beg = BUF_BEGV_BYTE. 

> But if so, why do we need the similar test
> for BUF_ZV_BYTE? could TS attempt to read beyond the "visible end”?

Tree-sitter doesn’t know the size of the buffer, it just keeps reading until the read function sets bytes_read to 0, signaling that it has reached the end. 

Yuan





^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-30 14:17                                                                                                     ` Yuan Fu
@ 2021-08-03 10:24                                                                                                       ` Fu Yuan
  2021-08-03 11:42                                                                                                         ` Eli Zaretskii
  2021-08-03 11:47                                                                                                       ` Eli Zaretskii
  1 sibling, 1 reply; 284+ messages in thread
From: Fu Yuan @ 2021-08-03 10:24 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Clément Pit-Claudel, Stephen Leake, Stefan Monnier, emacs-devel

I’m about to change all lisp-facing functions from using byte position to using point. Point is much easier to work with. If lisp wants byte positions, they can just convert from point themselves. Any objections?

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-08-03 10:24                                                                                                       ` Fu Yuan
@ 2021-08-03 11:42                                                                                                         ` Eli Zaretskii
  2021-08-03 11:53                                                                                                           ` Fu Yuan
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-08-03 11:42 UTC (permalink / raw)
  To: Fu Yuan; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel

> From: Fu Yuan <casouri@gmail.com>
> Date: Tue, 3 Aug 2021 06:24:34 -0400
> Cc: Stephen Leake <stephen_leake@stephe-leake.org>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel@gnu.org
> 
> I’m about to change all lisp-facing functions from using byte position to using point.

I don't understand how can you do this.  Point is set by Lisp, and
generally cannot be changed from C, except for very short durations of
time (or if the C code is the implementation of a Lisp command that
just moved point).  If you need to access some buffer position, you
cannot in general use point, because you cannot control where point
is.

> Point is much easier to work with.

In what way is it easier?  I feel that I'm missing something here.

> If lisp wants byte positions, they can just convert from point themselves.

??? What do you mean by that?  Can you show an example?



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-07-30 14:17                                                                                                     ` Yuan Fu
  2021-08-03 10:24                                                                                                       ` Fu Yuan
@ 2021-08-03 11:47                                                                                                       ` Eli Zaretskii
  2021-08-03 12:00                                                                                                         ` Fu Yuan
  1 sibling, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-08-03 11:47 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Fri, 30 Jul 2021 10:17:22 -0400
> Cc: Stephen Leake <stephen_leake@stephe-leake.org>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  Stefan Monnier <monnier@iro.umontreal.ca>,
>  emacs-devel@gnu.org
> 
> > That said, it looks like the code is correct: you should record the
> > changes in the entire buffer, but only pass to TS the changes inside
> > the restriction BEGV..ZV that is in effect at the time of the re-parse
> > call.  Btw, I don't see the code that filters changes reported to TS
> > by their positions against the restriction; did I miss something?
> 
> Yes, I do clip the change to the visible portion:
> 
> ts_record_change (ptrdiff_t start_byte, ptrdiff_t old_end_byte,
> 		  ptrdiff_t new_end_byte)
> {
>   eassert(start_byte <= old_end_byte);
>   eassert(start_byte <= new_end_byte);
> 
>   Lisp_Object parser_list = Fsymbol_value (Qtree_sitter_parser_list);
> 
>   while (!NILP (parser_list))
>     {
>       Lisp_Object lisp_parser = Fcar (parser_list);
>       TSTree *tree = XTS_PARSER (lisp_parser)->tree;
>       if (tree != NULL)
> 	{
> 	  /* We "clip" the change to between visible_beg and
> 	     visible_end.  It is okay if visible_end ends up larger
> 	     than BUF_Z, tree-sitter only access buffer text during
> 	     re-parse, and we will adjust visible_beg/end before
> 	     re-parse.  */
> 	  ptrdiff_t visible_beg = XTS_PARSER (lisp_parser)->visible_beg;
> 	  ptrdiff_t visible_end = XTS_PARSER (lisp_parser)->visible_end;
> 
> 	  ptrdiff_t visible_start =
> 	    max (visible_beg, start_byte) - visible_beg;
> 	  ptrdiff_t visible_old_end =
> 	    min (visible_end, old_end_byte) - visible_beg;
> 	  ptrdiff_t visible_new_end =
> 	    min (visible_end, new_end_byte) - visible_beg;
> 
> 	  ts_tree_edit_1 (tree, visible_start, visible_old_end,
> 			  visible_new_end);
> 	  XTS_PARSER (lisp_parser)->need_reparse = true;
> 
> 	  parser_list = Fcdr (parser_list);

Hmm... so a change that begins before the restriction and ends inside
the restriction will be sent as if it began at BEGV?  And the rest of
the change will be discarded?  Shouldn't you split such changes in
tow, send to TS the part inside the restriction, and store the rest
for the future, when/if the buffer is widened?



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-08-03 11:42                                                                                                         ` Eli Zaretskii
@ 2021-08-03 11:53                                                                                                           ` Fu Yuan
  2021-08-03 12:21                                                                                                             ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Fu Yuan @ 2021-08-03 11:53 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel


> 在 2021年8月3日,上午7:42,Eli Zaretskii <eliz@gnu.org> 写道:
> 
> 
>> 
>> From: Fu Yuan <casouri@gmail.com>
>> Date: Tue, 3 Aug 2021 06:24:34 -0400
>> Cc: Stephen Leake <stephen_leake@stephe-leake.org>,
>> Clément Pit-Claudel <cpitclaudel@gmail.com>,
>> Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel@gnu.org
>> 
>> I’m about to change all lisp-facing functions from using byte position to using point.
> 
> I don't understand how can you do this.  Point is set by Lisp, and
> generally cannot be changed from C, except for very short durations of
> time (or if the C code is the implementation of a Lisp command that
> just moved point).  If you need to access some buffer position, you
> cannot in general use point, because you cannot control where point
> is.
> 
>> Point is much easier to work with.
> 
> In what way is it easier?  I feel that I'm missing something here.
> 
>> If lisp wants byte positions, they can just convert from point themselves.
> 
> ??? What do you mean by that?  Can you show an example?

Oh no, I don’t mean that. I meant that, for example, functions like node_start_byte, which returns the byte position of the beginning of the node, will now be node_start_pos, which returns a point position. And if I want the byte position of the beginning of the node, I can use (position-to-byte (tree-sitter-node-start-pos node))

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-08-03 11:47                                                                                                       ` Eli Zaretskii
@ 2021-08-03 12:00                                                                                                         ` Fu Yuan
  2021-08-03 12:24                                                                                                           ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Fu Yuan @ 2021-08-03 12:00 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel


> 在 2021年8月3日,上午7:48,Eli Zaretskii <eliz@gnu.org> 写道:
> 
> 
>> 
>> From: Yuan Fu <casouri@gmail.com>
>> Date: Fri, 30 Jul 2021 10:17:22 -0400
>> Cc: Stephen Leake <stephen_leake@stephe-leake.org>,
>> Clément Pit-Claudel <cpitclaudel@gmail.com>,
>> Stefan Monnier <monnier@iro.umontreal.ca>,
>> emacs-devel@gnu.org
>> 
>>> That said, it looks like the code is correct: you should record the
>>> changes in the entire buffer, but only pass to TS the changes inside
>>> the restriction BEGV..ZV that is in effect at the time of the re-parse
>>> call.  Btw, I don't see the code that filters changes reported to TS
>>> by their positions against the restriction; did I miss something?
>> 
>> Yes, I do clip the change to the visible portion:
>> 
>> ts_record_change (ptrdiff_t start_byte, ptrdiff_t old_end_byte,
>>          ptrdiff_t new_end_byte)
>> {
>>  eassert(start_byte <= old_end_byte);
>>  eassert(start_byte <= new_end_byte);
>> 
>>  Lisp_Object parser_list = Fsymbol_value (Qtree_sitter_parser_list);
>> 
>>  while (!NILP (parser_list))
>>    {
>>      Lisp_Object lisp_parser = Fcar (parser_list);
>>      TSTree *tree = XTS_PARSER (lisp_parser)->tree;
>>      if (tree != NULL)
>>    {
>>      /* We "clip" the change to between visible_beg and
>>         visible_end.  It is okay if visible_end ends up larger
>>         than BUF_Z, tree-sitter only access buffer text during
>>         re-parse, and we will adjust visible_beg/end before
>>         re-parse.  */
>>      ptrdiff_t visible_beg = XTS_PARSER (lisp_parser)->visible_beg;
>>      ptrdiff_t visible_end = XTS_PARSER (lisp_parser)->visible_end;
>> 
>>      ptrdiff_t visible_start =
>>        max (visible_beg, start_byte) - visible_beg;
>>      ptrdiff_t visible_old_end =
>>        min (visible_end, old_end_byte) - visible_beg;
>>      ptrdiff_t visible_new_end =
>>        min (visible_end, new_end_byte) - visible_beg;
>> 
>>      ts_tree_edit_1 (tree, visible_start, visible_old_end,
>>              visible_new_end);
>>      XTS_PARSER (lisp_parser)->need_reparse = true;
>> 
>>      parser_list = Fcdr (parser_list);
> 
> Hmm... so a change that begins before the restriction and ends inside
> the restriction will be sent as if it began at BEGV?  And the rest of
> the change will be discarded?  Shouldn't you split such changes in
> tow, send to TS the part inside the restriction, and store the rest
> for the future, when/if the buffer is widened?

Tree-sitter doesn’t care about the content in a change, it will re-scan the buffer content when it re-parses. We only need to inform it the range of the change, so it knows where to re-scan when it re-parses. When the buffer is widened, we will tell tree-sitter that range [BUF_BEG, BUF_BEGV] has changed, and it will re-scan that part when re-parsing. So the part outside the narrowed region will be parsed correctly.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-08-03 11:53                                                                                                           ` Fu Yuan
@ 2021-08-03 12:21                                                                                                             ` Eli Zaretskii
  2021-08-03 12:50                                                                                                               ` Fu Yuan
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-08-03 12:21 UTC (permalink / raw)
  To: Fu Yuan; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel

> From: Fu Yuan <casouri@gmail.com>
> Date: Tue, 3 Aug 2021 07:53:54 -0400
> Cc: stephen_leake@stephe-leake.org, cpitclaudel@gmail.com,
>  monnier@iro.umontreal.ca, emacs-devel@gnu.org
> 
> Oh no, I don’t mean that. I meant that, for example, functions like node_start_byte, which returns the byte position of the beginning of the node, will now be node_start_pos, which returns a point position.

That's called "character position".  Let's use the accepted
terminology, to minimize misunderstandings.

So in what sense are character positions easier to use than byte
positions?

> And if I want the byte position of the beginning of the node, I can use (position-to-byte (tree-sitter-node-start-pos node))

Caveat: position-to-byte can be expensive.  So in time-critical code,
such as the display engine, we keep both character position and byte
position, and update them in sync.  Then you can use whichever is
easier in each case.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-08-03 12:00                                                                                                         ` Fu Yuan
@ 2021-08-03 12:24                                                                                                           ` Eli Zaretskii
  2021-08-03 13:00                                                                                                             ` Fu Yuan
  2021-08-03 13:28                                                                                                             ` Stefan Monnier
  0 siblings, 2 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-08-03 12:24 UTC (permalink / raw)
  To: Fu Yuan; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel

> From: Fu Yuan <casouri@gmail.com>
> Date: Tue, 3 Aug 2021 08:00:46 -0400
> Cc: stephen_leake@stephe-leake.org, cpitclaudel@gmail.com,
>  monnier@iro.umontreal.ca, emacs-devel@gnu.org
> 
> > Hmm... so a change that begins before the restriction and ends inside
> > the restriction will be sent as if it began at BEGV?  And the rest of
> > the change will be discarded?  Shouldn't you split such changes in
> > tow, send to TS the part inside the restriction, and store the rest
> > for the future, when/if the buffer is widened?
> 
> Tree-sitter doesn’t care about the content in a change, it will re-scan the buffer content when it re-parses. We only need to inform it the range of the change, so it knows where to re-scan when it re-parses. When the buffer is widened, we will tell tree-sitter that range [BUF_BEG, BUF_BEGV] has changed, and it will re-scan that part when re-parsing.

But that's sub-optimal, no?  Imagine a very large buffer which was
narrowed to a small portion near EOB, then a modification made very
close to EOB but partially before BEGV, then the buffer widened.  With
your method, TS will now have to re-parse almost the entire buffer,
whereas we know it needs to re-parse a very small portion of it.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-08-03 12:21                                                                                                             ` Eli Zaretskii
@ 2021-08-03 12:50                                                                                                               ` Fu Yuan
  2021-08-03 13:03                                                                                                                 ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Fu Yuan @ 2021-08-03 12:50 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel


> 在 2021年8月3日,上午8:22,Eli Zaretskii <eliz@gnu.org> 写道:
> 
> 
>> 
>> From: Fu Yuan <casouri@gmail.com>
>> Date: Tue, 3 Aug 2021 07:53:54 -0400
>> Cc: stephen_leake@stephe-leake.org, cpitclaudel@gmail.com,
>> monnier@iro.umontreal.ca, emacs-devel@gnu.org
>> 
>> Oh no, I don’t mean that. I meant that, for example, functions like node_start_byte, which returns the byte position of the beginning of the node, will now be node_start_pos, which returns a point position.
> 
> That's called "character position".  Let's use the accepted
> terminology, to minimize misunderstandings.

Ah, got it.

> So in what sense are character positions easier to use than byte
> positions?

Here are what you can do with positions:

- find the smallest node that encloses a range (BEG . END)
- get the beginning and end of a node

Since all other functions use character position (eg, put-text-property, point), using character positions saves lisp code some ‘position-to-bytes’. 

>> And if I want the byte position of the beginning of the node, I can use (position-to-byte (tree-sitter-node-start-pos node))
> 
> Caveat: position-to-byte can be expensive.  So in time-critical code,
> such as the display engine, we keep both character position and byte
> position, and update them in sync.  Then you can use whichever is
> easier in each case.

Internally, tree_sitter.c will continue to use byte positions, of course.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-08-03 12:24                                                                                                           ` Eli Zaretskii
@ 2021-08-03 13:00                                                                                                             ` Fu Yuan
  2021-08-03 13:28                                                                                                             ` Stefan Monnier
  1 sibling, 0 replies; 284+ messages in thread
From: Fu Yuan @ 2021-08-03 13:00 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel


> 在 2021年8月3日,上午8:25,Eli Zaretskii <eliz@gnu.org> 写道:
> 
> 
>> 
>> From: Fu Yuan <casouri@gmail.com>
>> Date: Tue, 3 Aug 2021 08:00:46 -0400
>> Cc: stephen_leake@stephe-leake.org, cpitclaudel@gmail.com,
>> monnier@iro.umontreal.ca, emacs-devel@gnu.org
>> 
>>> Hmm... so a change that begins before the restriction and ends inside
>>> the restriction will be sent as if it began at BEGV?  And the rest of
>>> the change will be discarded?  Shouldn't you split such changes in
>>> tow, send to TS the part inside the restriction, and store the rest
>>> for the future, when/if the buffer is widened?
>> 
>> Tree-sitter doesn’t care about the content in a change, it will re-scan the buffer content when it re-parses. We only need to inform it the range of the change, so it knows where to re-scan when it re-parses. When the buffer is widened, we will tell tree-sitter that range [BUF_BEG, BUF_BEGV] has changed, and it will re-scan that part when re-parsing.
> 
> But that's sub-optimal, no?  Imagine a very large buffer which was
> narrowed to a small portion near EOB, then a modification made very
> close to EOB but partially before BEGV, then the buffer widened.  With
> your method, TS will now have to re-parse almost the entire buffer,
> whereas we know it needs to re-parse a very small portion of it.

It is indeed, but that’s unavoidable by the way we hide the hidden part of the buffer from tree-sitter. We pretend BUF_BEGV is the beginning of the buffer and nothing exists before it. Then when we widen, we need to “insert” the content between BUF_BEG and BUF_BEGV. I.e., as far as tree-sitter can tell, we inserted that text.

If you want to hide something then re-show it to tree-sitter, and want tree-sitter to know how to re-parse minimally, you should use tree-sitter-parser-set-included-ranges (ts_parser_set_included_ranges). I’ve wrote the lisp binding for it but haven’t pushed the change.

The reason why I didn’t implement narrow with set-ranges was explained earlier.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-08-03 12:50                                                                                                               ` Fu Yuan
@ 2021-08-03 13:03                                                                                                                 ` Eli Zaretskii
  2021-08-03 13:08                                                                                                                   ` Fu Yuan
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-08-03 13:03 UTC (permalink / raw)
  To: Fu Yuan; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel

> From: Fu Yuan <casouri@gmail.com>
> Date: Tue, 3 Aug 2021 08:50:45 -0400
> Cc: stephen_leake@stephe-leake.org, cpitclaudel@gmail.com,
>  monnier@iro.umontreal.ca, emacs-devel@gnu.org
> 
> > So in what sense are character positions easier to use than byte
> > positions?
> 
> Here are what you can do with positions:
> 
> - find the smallest node that encloses a range (BEG . END)
> - get the beginning and end of a node
> 
> Since all other functions use character position (eg, put-text-property, point), using character positions saves lisp code some ‘position-to-bytes’. 

If you are talking about Lisp, then yes, character positions are a
much better interface.  But on the C level, sometimes you need byte
positions, sometimes character positions, and sometimes both.  Since
you didn't say what level was this about, I cannot say something more
intelligent.

> Internally, tree_sitter.c will continue to use byte positions, of course.

"Internally", as opposed to what?  And what is "internal" in this
context?  I thought we were talking only about the internals.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-08-03 13:03                                                                                                                 ` Eli Zaretskii
@ 2021-08-03 13:08                                                                                                                   ` Fu Yuan
  0 siblings, 0 replies; 284+ messages in thread
From: Fu Yuan @ 2021-08-03 13:08 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel


> 在 2021年8月3日,上午9:03,Eli Zaretskii <eliz@gnu.org> 写道:
> 
> 
>> 
>> From: Fu Yuan <casouri@gmail.com>
>> Date: Tue, 3 Aug 2021 08:50:45 -0400
>> Cc: stephen_leake@stephe-leake.org, cpitclaudel@gmail.com,
>> monnier@iro.umontreal.ca, emacs-devel@gnu.org
>> 
>>> So in what sense are character positions easier to use than byte
>>> positions?
>> 
>> Here are what you can do with positions:
>> 
>> - find the smallest node that encloses a range (BEG . END)
>> - get the beginning and end of a node
>> 
>> Since all other functions use character position (eg, put-text-property, point), using character positions saves lisp code some ‘position-to-bytes’. 
> 
> If you are talking about Lisp, then yes, character positions are a
> much better interface.  But on the C level, sometimes you need byte
> positions, sometimes character positions, and sometimes both.  Since
> you didn't say what level was this about, I cannot say something more
> intelligent.
> 
>> Internally, tree_sitter.c will continue to use byte positions, of course.
> 
> "Internally", as opposed to what?  And what is "internal" in this
> context?  I thought we were talking only about the internals.

By internally I mean C level. I will change lisp interface functions to accept and return character positions, and C level code will keep using byte positions.

I’ll try to make myself clearer next time :-)

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-08-03 12:24                                                                                                           ` Eli Zaretskii
  2021-08-03 13:00                                                                                                             ` Fu Yuan
@ 2021-08-03 13:28                                                                                                             ` Stefan Monnier
  2021-08-03 13:34                                                                                                               ` Eli Zaretskii
  1 sibling, 1 reply; 284+ messages in thread
From: Stefan Monnier @ 2021-08-03 13:28 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Fu Yuan, stephen_leake, cpitclaudel, emacs-devel

> But that's sub-optimal, no?  Imagine a very large buffer which was
> narrowed to a small portion near EOB, then a modification made very
> close to EOB but partially before BEGV, then the buffer widened.  With
> your method, TS will now have to re-parse almost the entire buffer,
> whereas we know it needs to re-parse a very small portion of it.

As a general rule, we will most likely want to work hard to avoid
exposing the narrowed buffer to TS (i.e. most calls to TS will first
`widen`).

Or we will want to keep several parse trees (one per narrowing).

We have the same problem already with `syntax-ppss` which we solve by
keeping two sets of data (`syntax-ppss-wide` and `syntax-ppss-narrow`).


        Stefan




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-08-03 13:28                                                                                                             ` Stefan Monnier
@ 2021-08-03 13:34                                                                                                               ` Eli Zaretskii
  2021-08-06  3:22                                                                                                                 ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-08-03 13:34 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: casouri, stephen_leake, cpitclaudel, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: Fu Yuan <casouri@gmail.com>,  stephen_leake@stephe-leake.org,
>  cpitclaudel@gmail.com,  emacs-devel@gnu.org
> Date: Tue, 03 Aug 2021 09:28:57 -0400
> 
> > But that's sub-optimal, no?  Imagine a very large buffer which was
> > narrowed to a small portion near EOB, then a modification made very
> > close to EOB but partially before BEGV, then the buffer widened.  With
> > your method, TS will now have to re-parse almost the entire buffer,
> > whereas we know it needs to re-parse a very small portion of it.
> 
> As a general rule, we will most likely want to work hard to avoid
> exposing the narrowed buffer to TS (i.e. most calls to TS will first
> `widen`).

Sure.  I was thinking about those corner cases where we won't.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-08-03 13:34                                                                                                               ` Eli Zaretskii
@ 2021-08-06  3:22                                                                                                                 ` Yuan Fu
  2021-08-06  6:37                                                                                                                   ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-08-06  3:22 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, stephen_leake, Stefan Monnier, emacs-devel

I’ve added bindings for set_ranges and pushed the latest code to https://github.com/casouri/emacs/tree/ts

As for now, I’ve created bindings for most of the functions I want to expose. Next I’ll probably write some more tests (in addition to responding reviews and comments).

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: How to add pseudo vector types
  2021-08-06  3:22                                                                                                                 ` Yuan Fu
@ 2021-08-06  6:37                                                                                                                   ` Eli Zaretskii
  2021-08-07  5:31                                                                                                                     ` Tree-sitter api (Was: Re: How to add pseudo vector types) Fu Yuan
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-08-06  6:37 UTC (permalink / raw)
  To: Yuan Fu; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Thu, 5 Aug 2021 23:22:17 -0400
> Cc: cpitclaudel@gmail.com, stephen_leake@stephe-leake.org,
>  Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel@gnu.org
> 
> I’ve added bindings for set_ranges and pushed the latest code to https://github.com/casouri/emacs/tree/ts
> 
> As for now, I’ve created bindings for most of the functions I want to expose. Next I’ll probably write some more tests (in addition to responding reviews and comments).

Thanks.

We should probably start thinking how to integrate TS-related
functionalities into Emacs in general.  E.g., should there be an
option to activate it? should this option be per major mode? something
else?



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Tree-sitter api (Was: Re: How to add pseudo vector types)
  2021-08-06  6:37                                                                                                                   ` Eli Zaretskii
@ 2021-08-07  5:31                                                                                                                     ` Fu Yuan
  2021-08-07  6:26                                                                                                                       ` Eli Zaretskii
  2021-08-07 15:47                                                                                                                       ` Tree-sitter api Stefan Monnier
  0 siblings, 2 replies; 284+ messages in thread
From: Fu Yuan @ 2021-08-07  5:31 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel


> 在 2021年8月6日,上午1:37,Eli Zaretskii <eliz@gnu.org> 写道:
> 
> 
>> 
>> From: Yuan Fu <casouri@gmail.com>
>> Date: Thu, 5 Aug 2021 23:22:17 -0400
>> Cc: cpitclaudel@gmail.com, stephen_leake@stephe-leake.org,
>> Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel@gnu.org
>> 
>> I’ve added bindings for set_ranges and pushed the latest code to https://github.com/casouri/emacs/tree/ts
>> 
>> As for now, I’ve created bindings for most of the functions I want to expose. Next I’ll probably write some more tests (in addition to responding reviews and comments).
> 
> Thanks.
> 
> We should probably start thinking how to integrate TS-related
> functionalities into Emacs in general.  E.g., should there be an
> option to activate it? should this option be per major mode? something
> else?

We should have a user option to control tree-sitter on major mode level. Maybe an alist where each car is a major node symbol and each cdr is a Boolean value toggling tree-sitter for that node.

We also need tree-sitter-maximum-buffer-size, so that buffer larger than this size won’t enable tree-sitter. (And we need to make sure we never use tree-sitter on buffers larger than 4GB because tree-sitter uses unint32.)

And we can provide a function free-sitter-should-activate-p that computes if we should enable tree-sitter in the current buffer by variables mentioned above, that can be used by major-modes when setting up.

I’m also thinking about having a tree-sitter-defaults that’s analogous to font-lock-defaults, that is set by each major node and used to generate tree-sitter-font-lock-settings.

As for indentation, we could provide some infrastructure like we do for font-locking, or we can just let major modes implement their indent function with tree-sitter api.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api (Was: Re: How to add pseudo vector types)
  2021-08-07  5:31                                                                                                                     ` Tree-sitter api (Was: Re: How to add pseudo vector types) Fu Yuan
@ 2021-08-07  6:26                                                                                                                       ` Eli Zaretskii
  2021-08-07 15:47                                                                                                                       ` Tree-sitter api Stefan Monnier
  1 sibling, 0 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-08-07  6:26 UTC (permalink / raw)
  To: Fu Yuan; +Cc: cpitclaudel, stephen_leake, monnier, emacs-devel

> From: Fu Yuan <casouri@gmail.com>
> Date: Sat, 7 Aug 2021 00:31:36 -0500
> Cc: cpitclaudel@gmail.com, stephen_leake@stephe-leake.org,
>  monnier@iro.umontreal.ca, emacs-devel@gnu.org
> 
> > We should probably start thinking how to integrate TS-related
> > functionalities into Emacs in general.  E.g., should there be an
> > option to activate it? should this option be per major mode? something
> > else?
> 
> We should have a user option to control tree-sitter on major mode level. Maybe an alist where each car is a major node symbol and each cdr is a Boolean value toggling tree-sitter for that node.
> 
> We also need tree-sitter-maximum-buffer-size, so that buffer larger than this size won’t enable tree-sitter. (And we need to make sure we never use tree-sitter on buffers larger than 4GB because tree-sitter uses unint32.)
> 
> And we can provide a function free-sitter-should-activate-p that computes if we should enable tree-sitter in the current buffer by variables mentioned above, that can be used by major-modes when setting up.
> 
> I’m also thinking about having a tree-sitter-defaults that’s analogous to font-lock-defaults, that is set by each major node and used to generate tree-sitter-font-lock-settings.
> 
> As for indentation, we could provide some infrastructure like we do for font-locking, or we can just let major modes implement their indent function with tree-sitter api.

SGTM, thanks.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-08-07  5:31                                                                                                                     ` Tree-sitter api (Was: Re: How to add pseudo vector types) Fu Yuan
  2021-08-07  6:26                                                                                                                       ` Eli Zaretskii
@ 2021-08-07 15:47                                                                                                                       ` Stefan Monnier
  2021-08-07 18:40                                                                                                                         ` Theodor Thornhill
  2021-08-08 22:56                                                                                                                         ` Yuan Fu
  1 sibling, 2 replies; 284+ messages in thread
From: Stefan Monnier @ 2021-08-07 15:47 UTC (permalink / raw)
  To: Fu Yuan; +Cc: Eli Zaretskii, cpitclaudel, stephen_leake, emacs-devel

> We should have a user option to control tree-sitter on major mode
> level. Maybe an alist where each car is a major node symbol and each cdr is
> a Boolean value toggling tree-sitter for that node.

The more traditional approach is to use a buffer-local var set by the
major mode or set via (add-hook '<MODE>-hook ...).

> As for indentation, we could provide some infrastructure like we do for
> font-locking, or we can just let major modes implement their indent function
> with tree-sitter api.

We should definitely provide the infrastructure (even if it's fairly
simple) so that major modes only have to provide some rules.


        Stefan




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-08-07 15:47                                                                                                                       ` Tree-sitter api Stefan Monnier
@ 2021-08-07 18:40                                                                                                                         ` Theodor Thornhill
  2021-08-07 19:53                                                                                                                           ` Stefan Monnier
  2021-08-08 22:56                                                                                                                         ` Yuan Fu
  1 sibling, 1 reply; 284+ messages in thread
From: Theodor Thornhill @ 2021-08-07 18:40 UTC (permalink / raw)
  To: Stefan Monnier, Fu Yuan
  Cc: Eli Zaretskii, cpitclaudel, stephen_leake, emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> We should have a user option to control tree-sitter on major mode
>> level. Maybe an alist where each car is a major node symbol and each cdr is
>> a Boolean value toggling tree-sitter for that node.
>
> The more traditional approach is to use a buffer-local var set by the
> major mode or set via (add-hook '<MODE>-hook ...).
>
>> As for indentation, we could provide some infrastructure like we do for
>> font-locking, or we can just let major modes implement their indent function
>> with tree-sitter api.
>
> We should definitely provide the infrastructure (even if it's fairly
> simple) so that major modes only have to provide some rules.
>

Yeah, though that quickly becomes not so simple, considering that
different languages have their own idiosyncrasies with indentation. C#,
for instance, is a rats nest of particularities.  And this is not
considering variations of style guides etc.  It would be nice to get an
api similar to what CC mode has.  Font locking is an easier problem,
since it's just "fontify from node-start to node-end".

I'm not sure how to best provide this api, but I've worked a lot with CC
mode and the new tree-sitter-indent [1].  It quickly gets confusing and
reminds me of `display-buffer`.  Providing both a
`tree-sitter-indent-engine` mode as well as the low level api for major
mode authors would be nice as well.

Providing something too simple would just make people not use it, since
the weirder cases won't be covered.

--
Theo

[1]: https://codeberg.org/FelipeLema/tree-sitter-indent.el



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-08-07 18:40                                                                                                                         ` Theodor Thornhill
@ 2021-08-07 19:53                                                                                                                           ` Stefan Monnier
  2021-08-17  6:18                                                                                                                             ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Stefan Monnier @ 2021-08-07 19:53 UTC (permalink / raw)
  To: Theodor Thornhill
  Cc: Fu Yuan, Eli Zaretskii, cpitclaudel, stephen_leake, emacs-devel

> Yeah, though that quickly becomes not so simple, considering that
> different languages have their own idiosyncrasies with indentation. C#,
> for instance, is a rats nest of particularities.  And this is not
> considering variations of style guides etc.  It would be nice to get an
> api similar to what CC mode has.

I'm thinking of rules specified via a function that takes a TS node
(from which the function can explore the rest of the TS tree) and return
the indentation to use, represented as a pair (POSITION . OFFSET)
(meaning to indent OFFSET columns further than the column position of
POSITION).

The infrastructure would limit itself to making sure we have an uptodate
tree (computed from a properly widened buffer), find the node
corresponding to point pass it to the function and then turn the return
value into an actual column and indent the text accordingly (paying
attention to the usual difference between when point is "within the
indentation" vs "within the text").


        Stefan




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-08-07 15:47                                                                                                                       ` Tree-sitter api Stefan Monnier
  2021-08-07 18:40                                                                                                                         ` Theodor Thornhill
@ 2021-08-08 22:56                                                                                                                         ` Yuan Fu
  2021-08-08 23:24                                                                                                                           ` Stefan Monnier
  1 sibling, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-08-08 22:56 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: Eli Zaretskii, Stephen Leake, Clément Pit-Claudel, emacs-devel



> On Aug 7, 2021, at 10:47 AM, Stefan Monnier <monnier@iro.umontreal.ca> wrote:
> 
>> We should have a user option to control tree-sitter on major mode
>> level. Maybe an alist where each car is a major node symbol and each cdr is
>> a Boolean value toggling tree-sitter for that node.
> 
> The more traditional approach is to use a buffer-local var set by the
> major mode or set via (add-hook '<MODE>-hook ...).

The major-mode would setup things like font-lock-defaults and tree-sitter-defaults before major-mode-hook runs, so I think enabling/disabling tree-sitter in the hook is too late, no?

> 
>> As for indentation, we could provide some infrastructure like we do for
>> font-locking, or we can just let major modes implement their indent function
>> with tree-sitter api.
> 
> We should definitely provide the infrastructure (even if it's fairly
> simple) so that major modes only have to provide some rules.

I don’t really know much about indenting but I’ll try my best. Suggestions are definitely welcome.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-08-08 22:56                                                                                                                         ` Yuan Fu
@ 2021-08-08 23:24                                                                                                                           ` Stefan Monnier
  2021-08-09  0:06                                                                                                                             ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Stefan Monnier @ 2021-08-08 23:24 UTC (permalink / raw)
  To: Yuan Fu
  Cc: Eli Zaretskii, Clément Pit-Claudel, Stephen Leake, emacs-devel

Yuan Fu [2021-08-08 17:56:33] wrote:
>> On Aug 7, 2021, at 10:47 AM, Stefan Monnier <monnier@iro.umontreal.ca> wrote:
>>> We should have a user option to control tree-sitter on major mode
>>> level. Maybe an alist where each car is a major node symbol and each cdr is
>>> a Boolean value toggling tree-sitter for that node.
>> The more traditional approach is to use a buffer-local var set by the
>> major mode or set via (add-hook '<MODE>-hook ...).
> The major-mode would setup things like font-lock-defaults and
> tree-sitter-defaults before major-mode-hook runs, so I think
> enabling/disabling tree-sitter in the hook is too late, no?

I don't see why.  Presumably the major mode would set some vars relevant
to the tree-sitter support, but then whether those vars are used will
depend on the buffer-local boolean var (let's call it `tree-sitter-mode`).

I'm sure there will be issues w.r.t initialization order, e.g. in case
`font-lock-mode` is enabled before `tree-sitter-mode`, but that doesn't
seem very serious (`font-lock-mode` doesn't do much anyway, since the
real work is postponed until the next redisplay by jit-lock, so we could
"refresh" font-lock settings fairly cheaply within `tree-sitter-mode`).


        Stefan




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-08-08 23:24                                                                                                                           ` Stefan Monnier
@ 2021-08-09  0:06                                                                                                                             ` Yuan Fu
  0 siblings, 0 replies; 284+ messages in thread
From: Yuan Fu @ 2021-08-09  0:06 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: Eli Zaretskii, Stephen Leake, Clément Pit-Claudel, emacs-devel



> On Aug 8, 2021, at 6:24 PM, Stefan Monnier <monnier@iro.umontreal.ca> wrote:
> 
> Yuan Fu [2021-08-08 17:56:33] wrote:
>>> On Aug 7, 2021, at 10:47 AM, Stefan Monnier <monnier@iro.umontreal.ca> wrote:
>>>> We should have a user option to control tree-sitter on major mode
>>>> level. Maybe an alist where each car is a major node symbol and each cdr is
>>>> a Boolean value toggling tree-sitter for that node.
>>> The more traditional approach is to use a buffer-local var set by the
>>> major mode or set via (add-hook '<MODE>-hook ...).
>> The major-mode would setup things like font-lock-defaults and
>> tree-sitter-defaults before major-mode-hook runs, so I think
>> enabling/disabling tree-sitter in the hook is too late, no?
> 
> I don't see why.  Presumably the major mode would set some vars relevant
> to the tree-sitter support, but then whether those vars are used will
> depend on the buffer-local boolean var (let's call it `tree-sitter-mode`).
> 
> I'm sure there will be issues w.r.t initialization order, e.g. in case
> `font-lock-mode` is enabled before `tree-sitter-mode`, but that doesn't
> seem very serious (`font-lock-mode` doesn't do much anyway, since the
> real work is postponed until the next redisplay by jit-lock, so we could
> "refresh" font-lock settings fairly cheaply within `tree-sitter-mode`).

Instead of a tree-sitter-mode, I made font-lock use tree-sitter features in addition to using keywords. Basically I added another fontification pass (tree-sitter pass) in addition to the current two, syntactic pass and regex pass. 
(Syntactic pass is probably unnecessary if tree-sitter is enabled, tho). This way someone can still add regexp-based fontification even he uses tree-sitter for “standard” fontification.

And under this scheme, a major-mode would want something like this in the major-mode definition:

(if (tree-sitter-should-enable-p)
      (progn (setq-local font-lock-tree-sitter-defaults
                         '((ts-c-tree-sitter-settings-1)))
             (setq-local font-lock-defaults
                         (ignore t nil nil nil)))
    (setq-local font-lock-defaults
                '((c-font-lock-keywords
                   c-font-lock-keywords-1
                   c-font-lock-keywords-2
                   c-font-lock-keywords-3)
                  nil nil
                  ((95 . "w")
                   (36 . "w"))
                  c-beginning-of-syntax
                  (font-lock-mark-block-function . c-mark-function))))

In this scheme, changing whether to enable tree-sitter is too late in major-mode-hook (not impossible, of course).

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-08-07 19:53                                                                                                                           ` Stefan Monnier
@ 2021-08-17  6:18                                                                                                                             ` Yuan Fu
  2021-08-18 18:27                                                                                                                               ` Stephen Leake
                                                                                                                                                 ` (2 more replies)
  0 siblings, 3 replies; 284+ messages in thread
From: Yuan Fu @ 2021-08-17  6:18 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: Stephen Leake, Eli Zaretskii, Theodor Thornhill,
	Clément Pit-Claudel, emacs-devel

> 
> I'm thinking of rules specified via a function that takes a TS node
> (from which the function can explore the rest of the TS tree) and return
> the indentation to use, represented as a pair (POSITION . OFFSET)
> (meaning to indent OFFSET columns further than the column position of
> POSITION).
> 
> The infrastructure would limit itself to making sure we have an uptodate
> tree (computed from a properly widened buffer), find the node
> corresponding to point pass it to the function and then turn the return
> value into an actual column and indent the text accordingly (paying
> attention to the usual difference between when point is "within the
> indentation" vs "within the text”).

Okay, here is the (ad-hoc) infrastructure I came up with:

We have a tree-sitter-simple-indent-function. Major-mode authors can set indent-line-function to it to use the simple-indent system. tree-sitter-simple-indent-function indents according to tree-sitter-simple-indent-rules. Doc string of tree-sitter-simple-indent-rules reads:

    A list of indent rule settings.
    Each indent rule setting should be (LANGUAGE . RULES),
    where LANGUAGE is a language symbol, and RULES is a list of
    (MATCHER ANCHOR OFFSET).

    MATCHER determines whether this rule applies, ANCHOR and OFFSET
    together determines which column to indent to.

    A MATCHER is a function that takes three arguments (NODE PARENT
    BOL).  NODE is the largest (highest-in-tree) node starting at
    point.  PARENT is the parent of NODE.  BOL is the point where we
    are indenting: the beginning of line content, the position of the
    first non-whitespace character.

    If MATCHER returns non-nil, meaning the rule matches, Emacs then
    uses ANCHOR to find an anchor, it should be a function that takes
    the same argument (NODE PARENT BOL) and returns a point.

    Finally Emacs computes the column of that point returned by ANCHOR
    and adds OFFSET to it, and indent the line to that column.

    For MATCHER and ANCHOR, Emacs provides some convenient presets.
    See `tree-sitter-simple-indent-presets’.

And doc string for tree-sitter-simple-indent-presets:

    A list of presets.
    These presets can be used as MATHER and ANCHOR in
    `tree-sitter-simple-indent-rules'.

    MATCHER:

    (match NODE-TYPE PARENT-TYPE NODE-FIELD NODE-INDEX-MIN NODE-INDEX-MAX)

        NODE-TYPE checks for node's type, PARENT-TYPE check for
        parent's type, NODE-FIELD checks for the filed name of node
        in the parent, NODE-INDEX-MIN and NODE-INDEX-MAX checks for
        the node's index in the parent.  Therefore, to match the
        first child where parent is \"argument_list\", use (match nil
        \"argument_list\" nil nil 0 0).

    no-node

        Matches the case where node is nil, i.e., there is no node
        that starts at point.  This is the case when indenting an
        empty line.

    (node-at-point TYPE NAMED)

        Check that the node at point -- not the largest node starting at
        point -- has type TYPE.  If NAMED non-nil, check the named node
        at point.

    (parent-is TYPE)

        Check that the parent has type TYPE.

    (node-is TYPE)

        Checks that the node has type TYPE.

    (parent-match PATTERN)

        Checks that the parent matches PATTERN, a query pattern.

    (node-match PATTERN)

        Checks that the node matches PATTERN, a query pattern.

    ANCHOR:

    first-child

        Find the first child of the parent.

    parent

        Find the parent.

    prev-sibling

        Find node's previous sibling.

    no-indent

        Do nothing.

    prev-line

        Find the named node on previous line.  This can be used when
        indenting an empty line: just indent like the previous node.

An example of using these facility can be found in ts-c-tree-sitter-indent-rules.

For example, 

    ((match nil "function_definition" "body") parent 0)

means “match the node which it’s parent’s type is “function_definition” and its field name is “body”, indent to the start of its parent. That indents the starting braces in

int main ()
{
}

    ((parent-is "call_expression") parent 2)

Means “match the node which its’ parent’s type is “call_expression”, and indent to the start of its parent + 2. That indents the second line in

my_cool_function
  (arg1, arg2, arg3)

I’ve implemented some indentation rules for C in ts-c-mode as usual. I expect someone more knowledgeable in C to actually implement it later.

So… do you think this is ok, or convoluted? In particular, is there a better way to implement those “presets”? I don’t want to define them as normal functions, because then their name will be super long (parent-is -> tree-sitter-simple-indent-parent-is) and annoying to use when writing rules, but putting them in an alist (tree-sitter-simple-indent-presets) is a bit ad-hoc. I call these presets with tree-sitter--simple-apply, which basically looks up tree-sitter-simple-indent-presets, get the function and apply it.

You can find the latest version at https://github.com/casouri/emacs/tree/ts
I.e., git clone https://github.com/casouri/emacs.git --branch ts

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-08-17  6:18                                                                                                                             ` Yuan Fu
@ 2021-08-18 18:27                                                                                                                               ` Stephen Leake
  2021-08-18 21:30                                                                                                                                 ` Yuan Fu
  2021-08-23  6:51                                                                                                                                 ` Yuan Fu
  2021-08-22  2:43                                                                                                                               ` Yuan Fu
  2021-08-25  0:21                                                                                                                               ` Stefan Monnier
  2 siblings, 2 replies; 284+ messages in thread
From: Stephen Leake @ 2021-08-18 18:27 UTC (permalink / raw)
  To: Yuan Fu
  Cc: Eli Zaretskii, Theodor Thornhill, Stefan Monnier,
	Clément Pit-Claudel, emacs-devel

This looks very interesting, but I have a migraine right now, so I'll
have to look at it later.

You could try writing indent rules for Ada; current ada-mode code is in
https://savannah.nongnu.org/git/?group=ada-mode. See the test/ directory
for examples of known good indentation.

ada-mode takes the approach of embedding the indent rules directly in
the grammar, and the functions that do that provide a few more options
than yours. To see the definition of those functions, you'll have to
install the wisi package, and look in wisi.info, section Grammar
actions. (it would be nice if that info/html file was linked from the
GNU ELPA package page; I'll start a new thread for that).

Yuan Fu <casouri@gmail.com> writes:

>> 
>> I'm thinking of rules specified via a function that takes a TS node
>> (from which the function can explore the rest of the TS tree) and return
>> the indentation to use, represented as a pair (POSITION . OFFSET)
>> (meaning to indent OFFSET columns further than the column position of
>> POSITION).
>> 
>> The infrastructure would limit itself to making sure we have an uptodate
>> tree (computed from a properly widened buffer), find the node
>> corresponding to point pass it to the function and then turn the return
>> value into an actual column and indent the text accordingly (paying
>> attention to the usual difference between when point is "within the
>> indentation" vs "within the text”).
>
> Okay, here is the (ad-hoc) infrastructure I came up with:
>
> We have a tree-sitter-simple-indent-function. Major-mode authors can set indent-line-function to it to use the simple-indent system. tree-sitter-simple-indent-function indents according to tree-sitter-simple-indent-rules. Doc string of tree-sitter-simple-indent-rules reads:
>
>     A list of indent rule settings.
>     Each indent rule setting should be (LANGUAGE . RULES),
>     where LANGUAGE is a language symbol, and RULES is a list of
>     (MATCHER ANCHOR OFFSET).
>
>     MATCHER determines whether this rule applies, ANCHOR and OFFSET
>     together determines which column to indent to.
>
>     A MATCHER is a function that takes three arguments (NODE PARENT
>     BOL).  NODE is the largest (highest-in-tree) node starting at
>     point.  PARENT is the parent of NODE.  BOL is the point where we
>     are indenting: the beginning of line content, the position of the
>     first non-whitespace character.
>
>     If MATCHER returns non-nil, meaning the rule matches, Emacs then
>     uses ANCHOR to find an anchor, it should be a function that takes
>     the same argument (NODE PARENT BOL) and returns a point.
>
>     Finally Emacs computes the column of that point returned by ANCHOR
>     and adds OFFSET to it, and indent the line to that column.
>
>     For MATCHER and ANCHOR, Emacs provides some convenient presets.
>     See `tree-sitter-simple-indent-presets’.
>
> And doc string for tree-sitter-simple-indent-presets:
>
>     A list of presets.
>     These presets can be used as MATHER and ANCHOR in
>     `tree-sitter-simple-indent-rules'.
>
>     MATCHER:
>
>     (match NODE-TYPE PARENT-TYPE NODE-FIELD NODE-INDEX-MIN NODE-INDEX-MAX)
>
>         NODE-TYPE checks for node's type, PARENT-TYPE check for
>         parent's type, NODE-FIELD checks for the filed name of node
>         in the parent, NODE-INDEX-MIN and NODE-INDEX-MAX checks for
>         the node's index in the parent.  Therefore, to match the
>         first child where parent is \"argument_list\", use (match nil
>         \"argument_list\" nil nil 0 0).
>
>     no-node
>
>         Matches the case where node is nil, i.e., there is no node
>         that starts at point.  This is the case when indenting an
>         empty line.
>
>     (node-at-point TYPE NAMED)
>
>         Check that the node at point -- not the largest node starting at
>         point -- has type TYPE.  If NAMED non-nil, check the named node
>         at point.
>
>     (parent-is TYPE)
>
>         Check that the parent has type TYPE.
>
>     (node-is TYPE)
>
>         Checks that the node has type TYPE.
>
>     (parent-match PATTERN)
>
>         Checks that the parent matches PATTERN, a query pattern.
>
>     (node-match PATTERN)
>
>         Checks that the node matches PATTERN, a query pattern.
>
>     ANCHOR:
>
>     first-child
>
>         Find the first child of the parent.
>
>     parent
>
>         Find the parent.
>
>     prev-sibling
>
>         Find node's previous sibling.
>
>     no-indent
>
>         Do nothing.
>
>     prev-line
>
>         Find the named node on previous line.  This can be used when
>         indenting an empty line: just indent like the previous node.
>
> An example of using these facility can be found in ts-c-tree-sitter-indent-rules.
>
> For example, 
>
>     ((match nil "function_definition" "body") parent 0)
>
> means “match the node which it’s parent’s type is “function_definition” and its field name is “body”, indent to the start of its parent. That indents the starting braces in
>
> int main ()
> {
> }
>
>     ((parent-is "call_expression") parent 2)
>
> Means “match the node which its’ parent’s type is “call_expression”, and indent to the start of its parent + 2. That indents the second line in
>
> my_cool_function
>   (arg1, arg2, arg3)
>
> I’ve implemented some indentation rules for C in ts-c-mode as usual. I expect someone more knowledgeable in C to actually implement it later.
>
> So… do you think this is ok, or convoluted? In particular, is there a better way to implement those “presets”? I don’t want to define them as normal functions, because then their name will be super long (parent-is -> tree-sitter-simple-indent-parent-is) and annoying to use when writing rules, but putting them in an alist (tree-sitter-simple-indent-presets) is a bit ad-hoc. I call these presets with tree-sitter--simple-apply, which basically looks up tree-sitter-simple-indent-presets, get the function and apply it.
>
> You can find the latest version at https://github.com/casouri/emacs/tree/ts
> I.e., git clone https://github.com/casouri/emacs.git --branch ts
>
> Yuan
>
>

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-08-18 18:27                                                                                                                               ` Stephen Leake
@ 2021-08-18 21:30                                                                                                                                 ` Yuan Fu
  2021-08-20  0:12                                                                                                                                   ` [SPAM UNSURE] " Stephen Leake
  2021-08-23  6:51                                                                                                                                 ` Yuan Fu
  1 sibling, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-08-18 21:30 UTC (permalink / raw)
  To: Stephen Leake
  Cc: Eli Zaretskii, Theodor Thornhill, Stefan Monnier,
	Clément Pit-Claudel, emacs-devel

> 
> This looks very interesting, but I have a migraine right now, so I'll
> have to look at it later.

Hope you get better soon :-)

> You could try writing indent rules for Ada; current ada-mode code is in
> https://savannah.nongnu.org/git/?group=ada-mode. See the test/ directory
> for examples of known good indentation.
> 
> ada-mode takes the approach of embedding the indent rules directly in
> the grammar, and the functions that do that provide a few more options
> than yours. To see the definition of those functions, you'll have to
> install the wisi package, and look in wisi.info, section Grammar
> actions. (it would be nice if that info/html file was linked from the
> GNU ELPA package page; I'll start a new thread for that).

Thanks. I’ll see what I can do; I know nearly nothing about Ada except that it is commissioned by the department of defense :-)

BTW, while I was reading the manual, I noticed a typo:

If token labels are used in a right hand side, they must be
given explicitly in the indent arguments, using he lisp "cons"
                                               ^
syntax.  Labels are normally only used with EBNF grammars,
which expand into multiple right hand sides, with optional
tokens simply left out.  Explicit labels on the indent
arguments allow them to be left out as well.

Yuan





^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: [SPAM UNSURE] Re: Tree-sitter api
  2021-08-18 21:30                                                                                                                                 ` Yuan Fu
@ 2021-08-20  0:12                                                                                                                                   ` Stephen Leake
  0 siblings, 0 replies; 284+ messages in thread
From: Stephen Leake @ 2021-08-20  0:12 UTC (permalink / raw)
  To: Yuan Fu
  Cc: Eli Zaretskii, Theodor Thornhill, Stefan Monnier,
	Clément Pit-Claudel, emacs-devel

Yuan Fu <casouri@gmail.com> writes:

>> You could try writing indent rules for Ada; current ada-mode code is in
>> https://savannah.nongnu.org/git/?group=ada-mode. See the test/ directory
>> for examples of known good indentation.
>> 
>> ada-mode takes the approach of embedding the indent rules directly in
>> the grammar, and the functions that do that provide a few more options
>> than yours. To see the definition of those functions, you'll have to
>> install the wisi package, and look in wisi.info, section Grammar
>> actions. (it would be nice if that info/html file was linked from the
>> GNU ELPA package page; I'll start a new thread for that).
>
> Thanks. I’ll see what I can do; I know nearly nothing about Ada except
> that it is commissioned by the department of defense :-)

Was, a long time ago. Now it is used by high-security, high-reliability
applications (train control, spacecraft (European, not NASA, sigh), banks).

AdaCore is a company thriving on the business model of selling support
for the Gnu Ada compiler and associated tools.

> BTW, while I was reading the manual, I noticed a typo:

Thanks.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-08-17  6:18                                                                                                                             ` Yuan Fu
  2021-08-18 18:27                                                                                                                               ` Stephen Leake
@ 2021-08-22  2:43                                                                                                                               ` Yuan Fu
  2021-08-22  3:46                                                                                                                                 ` Yuan Fu
  2021-08-22  6:15                                                                                                                                 ` Eli Zaretskii
  2021-08-25  0:21                                                                                                                               ` Stefan Monnier
  2 siblings, 2 replies; 284+ messages in thread
From: Yuan Fu @ 2021-08-22  2:43 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: Stephen Leake, Eli Zaretskii, Theodor Thornhill,
	Clément Pit-Claudel, emacs-devel

I’m trying to automate building the dynamic modules for each language definition. The files are largely identical for each language, I just need to replace language-specific names in each file. Can I use awk?

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-08-22  2:43                                                                                                                               ` Yuan Fu
@ 2021-08-22  3:46                                                                                                                                 ` Yuan Fu
  2021-08-22  6:16                                                                                                                                   ` Eli Zaretskii
  2021-08-22  6:15                                                                                                                                 ` Eli Zaretskii
  1 sibling, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-08-22  3:46 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: Stephen Leake, Eli Zaretskii, Theodor Thornhill,
	Clément Pit-Claudel, emacs-devel



> On Aug 21, 2021, at 7:43 PM, Yuan Fu <casouri@gmail.com> wrote:
> 
> I’m trying to automate building the dynamic modules for each language definition. The files are largely identical for each language, I just need to replace language-specific names in each file. Can I use awk?

Actually, after searching for a bit more, I think what I need is sed. Or there are better tools that I don’t know about? Maybe I can just use emacs --batch?

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-08-22  2:43                                                                                                                               ` Yuan Fu
  2021-08-22  3:46                                                                                                                                 ` Yuan Fu
@ 2021-08-22  6:15                                                                                                                                 ` Eli Zaretskii
  1 sibling, 0 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-08-22  6:15 UTC (permalink / raw)
  To: Yuan Fu; +Cc: stephen_leake, cpitclaudel, theo, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Sat, 21 Aug 2021 19:43:31 -0700
> Cc: Theodor Thornhill <theo@thornhill.no>,
>  Eli Zaretskii <eliz@gnu.org>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  Stephen Leake <stephen_leake@stephe-leake.org>,
>  emacs-devel <emacs-devel@gnu.org>
> 
> I’m trying to automate building the dynamic modules for each language definition. The files are largely identical for each language, I just need to replace language-specific names in each file. Can I use awk?

Yes.  We already use Awk in a couple of places in the build process.

Another possibility is to use Emacs, if what you need to do is not
part of bootstrap.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-08-22  3:46                                                                                                                                 ` Yuan Fu
@ 2021-08-22  6:16                                                                                                                                   ` Eli Zaretskii
  0 siblings, 0 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-08-22  6:16 UTC (permalink / raw)
  To: Yuan Fu; +Cc: stephen_leake, cpitclaudel, theo, monnier, emacs-devel

> From: Yuan Fu <casouri@gmail.com>
> Date: Sat, 21 Aug 2021 20:46:45 -0700
> Cc: Theodor Thornhill <theo@thornhill.no>,
>  Eli Zaretskii <eliz@gnu.org>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  Stephen Leake <stephen_leake@stephe-leake.org>,
>  emacs-devel <emacs-devel@gnu.org>
> 
> Actually, after searching for a bit more, I think what I need is sed. Or there are better tools that I don’t know about? Maybe I can just use emacs --batch?

Both are possible, but if what you need to do must be done as part of
bootstrap, Emacs might not be available yet.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-08-18 18:27                                                                                                                               ` Stephen Leake
  2021-08-18 21:30                                                                                                                                 ` Yuan Fu
@ 2021-08-23  6:51                                                                                                                                 ` Yuan Fu
  2021-08-24 14:59                                                                                                                                   ` [SPAM UNSURE] " Stephen Leake
  2021-08-24 22:51                                                                                                                                   ` Stefan Monnier
  1 sibling, 2 replies; 284+ messages in thread
From: Yuan Fu @ 2021-08-23  6:51 UTC (permalink / raw)
  To: Stephen Leake
  Cc: Eli Zaretskii, Theodor Thornhill, Stefan Monnier,
	Clément Pit-Claudel, emacs-devel

> 
> ada-mode takes the approach of embedding the indent rules directly in
> the grammar, and the functions that do that provide a few more options
> than yours. To see the definition of those functions, you'll have to
> install the wisi package, and look in wisi.info, section Grammar
> actions. (it would be nice if that info/html file was linked from the
> GNU ELPA package page; I'll start a new thread for that).

I had a cursory look at the manual for indent in wisi and have some questions. Why does wisi indent from “low-level productions”? (I think most indentation engine works line-by-line from the first line.) I don’t know much about how wisi works, but the indentation system seems to stem from circumstances quite different from that of tree-sitter. For example, wiki’s indent is devised alongside the grammar definition, while for tree-sitter, all the hard work of defining grammar is done for me and I’m merely a user of the grammar: that makes indenting with tree-sitter a much simpler job.

A problem I have with smie (and maybe wisi, but I didn’t look into wisi) is its seeming complexity. I’m merely a 22-year-old who drank too much coca-cola, and smie is too complicated for my soaked brain to comprehend. Having a traumatized experience trying to use smie[1], I want to make my indentation system as straightforward as possible. It doesn’t have to be complicated anyway, since it does so much less than wisi and smie. Right now I’d say it’s pretty simple, and most tasks (in indenting C) can be reasonably done, and I imagine difficult cases can be solved by writing custom matcher and anchor functions.

Stefan, can you have a look at tree-sitter-simple-indent? It’s like two messages up? It goes generally along the (pos . offset) idea but has some twists.

[1] Of course, I need to define the grammar when using smie while not when using tree-sitter, so it’s like comparing apple to pears, but I can’t resist finally telling a joke on the list.

Thanks,
Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: [SPAM UNSURE] Re: Tree-sitter api
  2021-08-23  6:51                                                                                                                                 ` Yuan Fu
@ 2021-08-24 14:59                                                                                                                                   ` Stephen Leake
  2021-08-27  5:18                                                                                                                                     ` [SPAM UNSURE] " Yuan Fu
  2021-08-24 22:51                                                                                                                                   ` Stefan Monnier
  1 sibling, 1 reply; 284+ messages in thread
From: Stephen Leake @ 2021-08-24 14:59 UTC (permalink / raw)
  To: Yuan Fu
  Cc: Eli Zaretskii, Theodor Thornhill, Stefan Monnier,
	Clément Pit-Claudel, emacs-devel

Yuan Fu <casouri@gmail.com> writes:

>> 
>> ada-mode takes the approach of embedding the indent rules directly in
>> the grammar, and the functions that do that provide a few more options
>> than yours. To see the definition of those functions, you'll have to
>> install the wisi package, and look in wisi.info, section Grammar
>> actions. (it would be nice if that info/html file was linked from the
>> GNU ELPA package page; I'll start a new thread for that).
>
> I had a cursory look at the manual for indent in wisi and have some
> questions. Why does wisi indent from “low-level productions”? 

The indent of every new-line must be specified; low level productions
can contain new-lines.

> (I think most indentation engine works line-by-line from the first
> line.) I don’t know much about how wisi works, but the indentation
> system seems to stem from circumstances quite different from that of
> tree-sitter. For example, wiki’s indent is devised alongside the
> grammar definition, while for tree-sitter, all the hard work of
> defining grammar is done for me and I’m merely a user of the grammar:
> that makes indenting with tree-sitter a much simpler job.

The Ada grammar is taken from the Ada Reference Manual; the indent
information is added after. The indent information could be in a
separate file, as in tree-sitter (wisitoken does not currently support
this; there would need to be a way to specify which production the
indent rule is associated with).

A tree-sitter based indent engine still has to specify the indent of
every new-line; it's the same amount of information.

Taking the examples from your email:

>     ((match nil "function_definition" "body") parent 0)

> means “match the node which it’s parent’s type is
> “function_definition” and its field name is “body”, indent to the
> start of its parent. That indents the starting braces in

> int main ()
> {
> }

Refering to the tree-sitter-c grammar at
https://github.com/tree-sitter/tree-sitter-c/blob/master/grammar.js,
there is a C grammar production (in tree-sitter syntax):

  function_definition: $ => seq(
      optional($.ms_call_modifier),
      $._declaration_specifiers,
      field('declarator', $._declarator),
      field('body', $.compound_statement)
    ),

In wisitoken syntax, this is:

  function_definition : [ms_call_modifier] declaration_specifiers
    declarator=declarator body=compound_statement

(the current wisi user guide does not define the "=" syntax for
declaring token names, but it is supported; I'll add it to the user
guide)

The indent rule specifies the indent of the field named 'body',
relative to the start of the production. So in wisitoken, this would
specify one component of the indent action for this production:

    {(wisi-indent-action [nil nil nil (body . 0)])}

Presumably there are other rules that specify the indent of the other
tokens in that production, so they would not be 'nil', which in
wisitoken means "undefined"; it is an error for any new-line to have an
undefined indent after all indent actions are applied.

Next example:

    ((parent-is "call_expression") parent 2)

The production is:

 call_expression: $ => prec(PREC.CALL, seq(
      field('function', $._expression),
      field('arguments', $.argument_list)
    )),

In wisitoken syntax (note that wisitoken does not support precedence
declarations (yet)):

 call_expression : function=expression arguments=argument_list
  {(wisi-indent-action [nil (arguments . 2)])}

So your syntax for indent is much more verbose than the wisi syntax
(because each token gets a separate rule), but specifies the same
information.

Your syntax also requires naming each token that is referenced in an
indent rule; wisitoken can use token position to do that, which is the
main reason indent is specified directly in the grammar file; it's very
easy to associate each indent expression with the corresponding token,
without having to make up names for the tokens. Here are the above
wisitoken productions without the token names:

  function_definition : [ms_call_modifier] declaration_specifiers
    declarator compound_statement
    {(wisi-indent-action [nil nil nil 0])}

  call_expression : expression argument_list
    {(wisi-indent-action [nil 2])}

To be fair, we'd have to look at the other types of rules, to see if
this pattern holds up.

I think you were biased by the "matching" rules tree-sitter supports.
That approach is reasonable when you only want to specify information
for a few nodes in the tree. Wisi assumes you want to specify indent
information for most of the nodes in the tree, so it supports a
tree-traversal model instead. Tree-sitter does support tree traversal,
but doesn't provide an easy way to add information for each node, as the
wisi indent-action syntax does.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-08-23  6:51                                                                                                                                 ` Yuan Fu
  2021-08-24 14:59                                                                                                                                   ` [SPAM UNSURE] " Stephen Leake
@ 2021-08-24 22:51                                                                                                                                   ` Stefan Monnier
  1 sibling, 0 replies; 284+ messages in thread
From: Stefan Monnier @ 2021-08-24 22:51 UTC (permalink / raw)
  To: Yuan Fu
  Cc: Stephen Leake, Eli Zaretskii, Theodor Thornhill,
	Clément Pit-Claudel, emacs-devel

> (I think most indentation engine works line-by-line from the
> first line.)

FWIW, the vast majority of the code performing indentation in the
various major modes in Emacs does it by parsing backward from the
position of point and doesn't work "line by line".

The "line by line" is only used for `indent-region` but the workhorse
function is in `indent-line-function` and only performs indentation of
a single line without touching anything else.

IOW, `indent-region` will usually go "line-by-line" but for each line
the actual work will be by parsing backward from that line
(i.e. re-parsing the previous lines that had just been parsed for the
previous line's indentation).  This is obviously not ideal in terms of
efficiency, but in practice indenting a single line usually only needs
to parse a small number of lines (I suspect it's almost O(1) of
*amortized* complexity so in most cases the algorithmic complexity of
`indent-region` is not really affected).

> Stefan, can you have a look at tree-sitter-simple-indent? It’s like two
> messages up? It goes generally along the (pos . offset) idea but has
> some twists.

It's in my todo list, yes.  I'm still backlog'd, tho.


        Stefan




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-08-17  6:18                                                                                                                             ` Yuan Fu
  2021-08-18 18:27                                                                                                                               ` Stephen Leake
  2021-08-22  2:43                                                                                                                               ` Yuan Fu
@ 2021-08-25  0:21                                                                                                                               ` Stefan Monnier
  2021-08-27  5:45                                                                                                                                 ` Yuan Fu
  2 siblings, 1 reply; 284+ messages in thread
From: Stefan Monnier @ 2021-08-25  0:21 UTC (permalink / raw)
  To: Yuan Fu
  Cc: Theodor Thornhill, Eli Zaretskii, Clément Pit-Claudel,
	Stephen Leake, emacs-devel

> Okay, here is the (ad-hoc) infrastructure I came up with:

It's more than what I proposed, but it looks fairly good.
See patch below which is the "side effect" of reading your code.

You'll see that I removed the "-function" from the function name (this
suffix is used for variables holding functions rather than for the
function themselves) and I split that function into two, the outer one
(tree-sitter-indent) implementing basically what I suggested and the
inner one (tree-sitter-simple-indent) implementing the extra structure
you added to it, mediated by a new var `tree-sitter-indent-function`
which modes can set if they want to use another algorithm than the one
you implemented in `tree-sitter-simple-indent`.

The reason why I divided it this way is that my experience with
indentation code is that it can be useful occasionally to call
recursively the indentation code to know where a node *would* be
indented.  This comes in handy when you want to be able to provide
indentation styles like:

    let myvariable = if (foo) {
            bar
          } else {
            baz
          }

where the body of the `if` branches needs to be indented relative to the
position where the `if` itself would be indented if it were on its own line.


        Stefan


PS: The patch also adds some space before open-paren-in-column-0-in-strings
to circumvent some problems with outline-minor-mode incorrectly thinking
those open-parens correspond to actual top-level definitions :-(


diff --git a/lisp/tree-sitter.el b/lisp/tree-sitter.el
index 83aa2d0d123..2c5d103c42d 100644
--- a/lisp/tree-sitter.el
+++ b/lisp/tree-sitter.el
@@ -52,6 +52,8 @@ tree-sitter-should-enable-p
 
 ;;; Parser API supplement
 
+(defvar tree-sitter-parser-list)
+
 (defun tree-sitter-get-parser (language)
   "Find the first parser using LANGUAGE in `tree-sitter-parser-list'."
   (catch 'found
@@ -196,7 +198,7 @@ tree-sitter-simple-indent-rules
   "A list of indent rule settings.
 Each indent rule setting should be (LANGUAGE . RULES),
 where LANGUAGE is a language symbol, and RULES is a list of
-(MATCHER ANCHOR OFFSET).
+  (MATCHER ANCHOR OFFSET).
 
 MATCHER determines whether this rule applies, ANCHOR and OFFSET
 together determines which column to indent to.
@@ -289,7 +291,7 @@ tree-sitter-simple-indent-presets
 
 MATCHER:
 
-(match NODE-TYPE PARENT-TYPE NODE-FIELD NODE-INDEX-MIN NODE-INDEX-MAX)
+  (match NODE-TYPE PARENT-TYPE NODE-FIELD NODE-INDEX-MIN NODE-INDEX-MAX)
 
     NODE-TYPE checks for node's type, PARENT-TYPE check for
     parent's type, NODE-FIELD checks for the filed name of node
@@ -304,25 +306,25 @@ tree-sitter-simple-indent-presets
     that starts at point.  This is the case when indenting an
     empty line.
 
-(node-at-point TYPE NAMED)
+  (node-at-point TYPE NAMED)
 
     Check that the node at point -- not the largest node at
     point, has type TYPE.  If NAMED non-nil, check the named node
     at point.
 
-(parent-is TYPE)
+  (parent-is TYPE)
 
     Check that the parent has type TYPE.
 
-(node-is TYPE)
+  (node-is TYPE)
 
     Checks that the node has type TYPE.
 
-(parent-match PATTERN)
+  (parent-match PATTERN)
 
     Checks that the parent matches PATTERN, a query pattern.
 
-(node-match PATTERN)
+  (node-match PATTERN)
 
     Checks that the node matches PATTERN, a query pattern.
 
@@ -356,7 +358,7 @@ tree-sitter--simple-apply
 
 If FN is a key in `tree-sitter-simple-indent-presets', use the
 corresponding value as the function."
-  (cond ((consp fn)
+  (cond ((consp fn) ;FIXME: This will mis-match for non-compiled lambdas!
          (apply (tree-sitter--simple-apply (car fn) (cdr fn))
                 args))
         ((and (symbolp fn)
@@ -366,21 +368,46 @@ tree-sitter--simple-apply
         ((functionp fn) (apply fn args))
         (t (error "Couldn't find appropriate function for FN"))))
 
-(defun tree-sitter-simple-indent-function ()
+(defvar tree-sitter-indent-function #'tree-sitter-simple-indent
+  "Document.")
+
+(defun tree-sitter-indent ()
   "Indent according to `tree-sitter-simple-indent-rules'."
-  (let* ((orig-pos (point))
-         (bol (save-excursion
+  (pcase-let*
+      ((orig-pos (point))
+       (bol (save-excursion
+              (beginning-of-line)
+              (skip-chars-forward " \t")
+              (point)))
+       (node (tree-sitter-parent-while
+              (cl-loop for parser in tree-sitter-parser-list
+                       for node = (tree-sitter-node-at
+                                   bol nil parser)
+                       if node return node)
+              (lambda (node)
+                (eq bol (tree-sitter-node-start node)))))
+       (parent (tree-sitter-node-parent node))
+       (`(,anchor . ,offset)
+        (funcall tree-sitter-indent-function node parent)))
+    (let ((col (+ (save-excursion
+                    (goto-char anchor)
+                    (current-column))
+                  offset)))
+      (if (< bol orig-pos)
+          (save-excursion
+            (indent-line-to col))
+        (indent-line-to col))
+      (when tree-sitter--indent-verbose
+        (message "indent to %S (%S position + %S)"
+                 col anchor offset)))))
+
+(defun tree-sitter-simple-indent (node parent)
+  (let* ((bol (save-excursion
                 (beginning-of-line)
                 (skip-chars-forward " \t")
                 (point)))
-         (node (tree-sitter-parent-while
-                (cl-loop for parser in tree-sitter-parser-list
-                         for node = (tree-sitter-node-at
-                                     bol nil parser)
-                         if node return node)
-                (lambda (node)
-                  (eq bol (tree-sitter-node-start node)))))
-         (parent (tree-sitter-node-parent node))
+         ;; FIXME: Can't we get the language from `node' rather than
+         ;; from `point'?
          (language (tree-sitter-language-at (point)))
          (rules (alist-get language tree-sitter-simple-indent-rules)))
     (cl-loop for rule in rules
@@ -388,20 +415,9 @@ tree-sitter-simple-indent-function
              for anchor = (nth 1 rule)
              for offset = (nth 2 rule)
              if (tree-sitter--simple-apply pred (list node parent bol))
-             do (let ((col (+ (save-excursion
-                                (goto-char
-                                 (tree-sitter--simple-apply
-                                  anchor (list node parent bol)))
-                                (current-column))
-                              offset)))
-                  (if (< bol orig-pos)
-                      (save-excursion
-                        (indent-line-to col))
-                    (indent-line-to col))
-                  (when tree-sitter--indent-verbose
-                    (message "matched %S\nindent to %s"
-                             pred col)))
-             and return nil)))
+             do `(,(tree-sitter--simple-apply
+                    anchor (list node parent bol))
+                  . ,offset))))
 
 ;;; Lab
 
@@ -435,7 +451,7 @@ ts-c-mode
                   (ignore t nil nil nil)
 
                   indent-line-function
-                  #'tree-sitter-simple-indent-function
+                  #'tree-sitter-indent
 
                   tree-sitter-simple-indent-rules
                   ts-c-tree-sitter-indent-rules)




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: [SPAM UNSURE] Tree-sitter api
  2021-08-24 14:59                                                                                                                                   ` [SPAM UNSURE] " Stephen Leake
@ 2021-08-27  5:18                                                                                                                                     ` Yuan Fu
  2021-08-31  0:48                                                                                                                                       ` Stephen Leake
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-08-27  5:18 UTC (permalink / raw)
  To: Stephen Leake
  Cc: Eli Zaretskii, Theodor Thornhill, Stefan Monnier,
	Clément Pit-Claudel, emacs-devel

Thank you very much for spending time on this :-)

> On Aug 24, 2021, at 7:59 AM, Stephen Leake <stephen_leake@stephe-leake.org> wrote:
> 
> Yuan Fu <casouri@gmail.com> writes:
> 
>>> 
>>> ada-mode takes the approach of embedding the indent rules directly in
>>> the grammar, and the functions that do that provide a few more options
>>> than yours. To see the definition of those functions, you'll have to
>>> install the wisi package, and look in wisi.info, section Grammar
>>> actions. (it would be nice if that info/html file was linked from the
>>> GNU ELPA package page; I'll start a new thread for that).
>> 
>> I had a cursory look at the manual for indent in wisi and have some
>> questions. Why does wisi indent from “low-level productions”? 
> 
> The indent of every new-line must be specified; low level productions
> can contain new-lines.

Ah, I see, what I did is to find the “largest” node that starts at BOL, and try to match that. IIUC, wisi starts from the “smallest” entity, and goes up (by getting its parent repeatedly) until there is a non-nil indent rule for it?

[snip]

> So your syntax for indent is much more verbose than the wisi syntax
> (because each token gets a separate rule), but specifies the same
> information.
> 
> Your syntax also requires naming each token that is referenced in an
> indent rule; wisitoken can use token position to do that, which is the
> main reason indent is specified directly in the grammar file; it's very
> easy to associate each indent expression with the corresponding token,
> without having to make up names for the tokens.

> Here are the above
> wisitoken productions without the token names:
> 
>  function_definition : [ms_call_modifier] declaration_specifiers
>    declarator compound_statement
>    {(wisi-indent-action [nil nil nil 0])}
> 
>  call_expression : expression argument_list
>    {(wisi-indent-action [nil 2])}
> 
> To be fair, we'd have to look at the other types of rules, to see if
> this pattern holds up.

I tried and all rules can be translated into wisi’s style. However, it ends up as verbose as the previous one. My idea is to write out match patterns (similar to that in wisi) and give names to the interesting ones (so we use names as opposed to position). Then, if any matched node happens to be the node at point, use that node’s corresponding indent rule to indent. And in the indent rule, we can refer to other matched nodes. For example, in the indent rule of list_rest, the anchor is list_first.

Maybe there are better ways to implement this, but at its current stage I don’t think this is better than tree-sitter-simple-indent.

I think part of the reason why wisi’s indent rule can be succinct is that it is written along the grammar definition. It is hard to make tree-sitter’s indent rule as succinct while being easy to understand.

(defvar tree-sitter-query-indent-rules
  '((tree-sitter-c
     "(function_definition body: (_) @body)

(field_declaration_list) @field_decl

(call_expression (_) @call_child)

(if_statement
 (condition) @if_cond
 (consequence) @if_cons
 (alternative) @if_alt
 \"else\" @else)

(switch_statement
 (condition) @switch_cond)

(case_statement
 (_) @case-child) @case

(compound_statement) @lbracket
\"}\" @rbracket

(compound_statement
 . (_) @list_first
 (_)* @list_rest)

(initializer_list
 . (_) @list_first
 (_)* @list_rest)

(argument_list
 . (_) @list_first
 (_)* @list_rest)

(parameter_list
 . (_) @list_first
 (_)* @list_rest)

(field_declaration_list
 . (_) @list_first
 (_)* @list_rest)
"
     (body parent 0)
     (field_decl parent 0)
     (call_child parent 2)
     (if_cond parent 2)
     (if_cons parent 2)
     (if_alt parent 2)
     (switch_cond parent 2)
     (else parent 0)
     (case parent 0)
     (case-child parent 2)
     (lbracket parent 2)
     (rbracket parent 0)
     (list_first parent 2)
     (list_rest list_first 0)))
  "A list of indent rule settings.
Each indent rule setting should be

    (LANGUAGE PATTERN INDENT INDENT...)

where LANGUAGE is a language symbol, PATTERN is a query pattern
string, and each INDENT is a list

    (CAPTURE_NAME ANCHOR OFFSET)

  If a captured node matches
with the node at point, Emacs looks for an INDENT that has a
matching CAPTURE_NAME, and use the ANCHOR and OFFSET of that
INDENT to indent the current line.

ANCHOR should be a capture name, this capture name should capture
another node in PATTERN.  Emacs finds the column of that node,
adds OFFSET to it, and indent the current line to that column.

TODO: examples in manual")

> 
> I think you were biased by the "matching" rules tree-sitter supports.
> That approach is reasonable when you only want to specify information
> for a few nodes in the tree. Wisi assumes you want to specify indent
> information for most of the nodes in the tree, so it supports a
> tree-traversal model instead.

I assumed that the indent rule for most nodes would be something basic, like “same as previous line”, and we only need to specify indent rules for some “special” nodes. 

IIUC, this tree-traversal method that you mentioned is like going bottom-up, and (in tree-sitter terms) match on each level, and accumulate indent delta for each matched indent rule, is that right? Does wisi go all the way up to top-level?

> Tree-sitter does support tree traversal,
> but doesn't provide an easy way to add information for each node, as the
> wisi indent-action syntax does.

Yes, I would still need to use a match pattern and name each node that I want to specify an indent delta for. There is no way to specify indent by position in the match pattern without naming each node.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-08-25  0:21                                                                                                                               ` Stefan Monnier
@ 2021-08-27  5:45                                                                                                                                 ` Yuan Fu
  2021-09-03 19:16                                                                                                                                   ` Theodor Thornhill
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-08-27  5:45 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: Stephen Leake, Eli Zaretskii, Theodor Thornhill,
	Clément Pit-Claudel, emacs-devel


> On Aug 24, 2021, at 5:21 PM, Stefan Monnier <monnier@iro.umontreal.ca> wrote:
> 
>> Okay, here is the (ad-hoc) infrastructure I came up with:
> 
> It's more than what I proposed, but it looks fairly good.
> See patch below which is the "side effect" of reading your code.
> 
> You'll see that I removed the "-function" from the function name (this
> suffix is used for variables holding functions rather than for the
> function themselves) and I split that function into two, the outer one
> (tree-sitter-indent) implementing basically what I suggested and the
> inner one (tree-sitter-simple-indent) implementing the extra structure
> you added to it, mediated by a new var `tree-sitter-indent-function`
> which modes can set if they want to use another algorithm than the one
> you implemented in `tree-sitter-simple-indent`.
> 
> The reason why I divided it this way is that my experience with
> indentation code is that it can be useful occasionally to call
> recursively the indentation code to know where a node *would* be
> indented.  This comes in handy when you want to be able to provide
> indentation styles like:
> 
>    let myvariable = if (foo) {
>            bar
>          } else {
>            baz
>          }
> 
> where the body of the `if` branches needs to be indented relative to the
> position where the `if` itself would be indented if it were on its own line.

Thanks, Stefan :-) I applied your patch and fixed the two FIXME’s.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: [SPAM UNSURE] Tree-sitter api
  2021-08-27  5:18                                                                                                                                     ` [SPAM UNSURE] " Yuan Fu
@ 2021-08-31  0:48                                                                                                                                       ` Stephen Leake
  0 siblings, 0 replies; 284+ messages in thread
From: Stephen Leake @ 2021-08-31  0:48 UTC (permalink / raw)
  To: Yuan Fu
  Cc: Eli Zaretskii, Theodor Thornhill, Stefan Monnier,
	Clément Pit-Claudel, emacs-devel

Yuan Fu <casouri@gmail.com> writes:

> Thank you very much for spending time on this :-)

And thank you for the same; always helpful to have different points of view.

>> The indent of every new-line must be specified; low level productions
>> can contain new-lines.
>
> Ah, I see, what I did is to find the “largest” node that starts at
> BOL, and try to match that. IIUC, wisi starts from the “smallest”
> entity, and goes up (by getting its parent repeatedly) until there is
> a non-nil indent rule for it?

That's almost right. The indent rule for each production is applied
while walking the entire syntax tree in depth-first order.

>> To be fair, we'd have to look at the other types of rules, to see if
>> this pattern holds up.
>
> I tried and all rules can be translated into wisi’s style. 

Ok.

> However, it ends up as verbose as the previous one. My idea is to
> write out match patterns (similar to that in wisi) and give names to
> the interesting ones (so we use names as opposed to position). Then,
> if any matched node happens to be the node at point, use that node’s
> corresponding indent rule to indent. And in the indent rule, we can
> refer to other matched nodes. For example, in the indent rule of
> list_rest, the anchor is list_first.
>
> Maybe there are better ways to implement this, but at its current
> stage I don’t think this is better than tree-sitter-simple-indent.

Ok.

> I think part of the reason why wisi’s indent rule can be succinct is
> that it is written along the grammar definition. It is hard to make
> tree-sitter’s indent rule as succinct while being easy to understand.

Right.

> IIUC, this tree-traversal method that you mentioned is like going
> bottom-up, and (in tree-sitter terms) match on each level, and
> accumulate indent delta for each matched indent rule, is that right?

Yes.

> Does wisi go all the way up to top-level?

Yes; the top-level rule says the indent of every line defaults to 0;
that covers any remaining 'nil' values.

I have not tried to make this part of ada-mode incremental yet (ie, only
visit changed nodes). I'm not sure that's possible.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-08-27  5:45                                                                                                                                 ` Yuan Fu
@ 2021-09-03 19:16                                                                                                                                   ` Theodor Thornhill
       [not found]                                                                                                                                     ` <AF64EB2C-CCEC-4C98-8FE3-37697BEC9098@gmail.com>
  0 siblings, 1 reply; 284+ messages in thread
From: Theodor Thornhill @ 2021-09-03 19:16 UTC (permalink / raw)
  To: Yuan Fu, Stefan Monnier
  Cc: Eli Zaretskii, Clément Pit-Claudel, Stephen Leake, emacs-devel

Yuan Fu <casouri@gmail.com> writes:

Hi!

>
> Thanks, Stefan :-) I applied your patch and fixed the two FIXME’s.
>

If I were to start experimenting with this in csharp-mode, how would I
start?  Right now we support the rust version on melpa, but I'd rather
move to this core-supported package.  How far are we from including this
in core, and what can I do to help?

All the best,
Theodor Thornhill



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
       [not found]                                                                                                                                     ` <AF64EB2C-CCEC-4C98-8FE3-37697BEC9098@gmail.com>
@ 2021-09-04 12:49                                                                                                                                       ` Tuấn-Anh Nguyễn
  2021-09-04 13:04                                                                                                                                         ` Eli Zaretskii
  2021-09-04 15:31                                                                                                                                         ` Yuan Fu
  2021-09-04 15:14                                                                                                                                       ` Tuấn-Anh Nguyễn
  2021-09-05 21:15                                                                                                                                       ` Theodor Thornhill
  2 siblings, 2 replies; 284+ messages in thread
From: Tuấn-Anh Nguyễn @ 2021-09-04 12:49 UTC (permalink / raw)
  To: Yuan Fu
  Cc: Clément Pit-Claudel, Theodor Thornhill, emacs-devel,
	Stefan Monnier, Eli Zaretskii, Stephen Leake

On Sat, Sep 4, 2021 at 1:44 PM Yuan Fu <casouri@gmail.com> wrote:
> 1) tree-sitter lacks a way to change its malloc behavior in run-time, I commented on their road-map issue, but no one has replied yet,
>
Do you mean APIs to change its alloc/free functions at run time? Why would we
need to do that? Doesn't simply defining `ts_malloc` and related functions work?
See https://github.com/tree-sitter/tree-sitter/blob/v0.20.0/lib/src/alloc.h#L27.

> 4) I need to work on a better way to build and distribute language dynamic modules.
>
I think there should be 2 mechanisms:
1. The binaries for common platforms should be built on Emacs's build
   infrastructure, and distributed through GNU ELPA.
2. There should be Lisp functions to download the grammar sources and compile
   them (by invoking the compiler).

> You can find a script for building dynamic modules at https://github.com/casouri/tree-sitter-module
>
I may be missing something here, but the grammars' compiled forms don't need to
be Emacs dynamic modules, right? They only need to be dynamically-loadable
shared libraries.

-- 
Tuấn-Anh Nguyễn
Software Engineer



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-04 12:49                                                                                                                                       ` Tuấn-Anh Nguyễn
@ 2021-09-04 13:04                                                                                                                                         ` Eli Zaretskii
  2021-09-04 14:49                                                                                                                                           ` Tuấn-Anh Nguyễn
  2021-09-04 15:31                                                                                                                                         ` Yuan Fu
  1 sibling, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-09-04 13:04 UTC (permalink / raw)
  To: Tuấn-Anh Nguyễn
  Cc: casouri, theo, cpitclaudel, emacs-devel, monnier, stephen_leake

> From: Tuấn-Anh Nguyễn <ubolonton@gmail.com>
> Date: Sat, 4 Sep 2021 19:49:35 +0700
> Cc: Theodor Thornhill <theo@thornhill.no>, Stephen Leake <stephen_leake@stephe-leake.org>, 
> 	Eli Zaretskii <eliz@gnu.org>, Clément Pit-Claudel <cpitclaudel@gmail.com>, 
> 	Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel <emacs-devel@gnu.org>
> 
> On Sat, Sep 4, 2021 at 1:44 PM Yuan Fu <casouri@gmail.com> wrote:
> > 1) tree-sitter lacks a way to change its malloc behavior in run-time, I commented on their road-map issue, but no one has replied yet,
> >
> Do you mean APIs to change its alloc/free functions at run time? Why would we
> need to do that?

Because what TS does when it runs out of memory is call 'exit'.
That's unacceptable for Emacs.  Emacs can handle out-of-memory
situations well enough, but it can only do that if the problem is
reported to it by the memory-allocation functions.

> Doesn't simply defining `ts_malloc` and related functions work?

No, because we want to be able to link against a TS library, we don't
want to require people who build Emacs to build TS as well.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-04 13:04                                                                                                                                         ` Eli Zaretskii
@ 2021-09-04 14:49                                                                                                                                           ` Tuấn-Anh Nguyễn
  2021-09-04 15:00                                                                                                                                             ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Tuấn-Anh Nguyễn @ 2021-09-04 14:49 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Yuan Fu, Theodor Thornhill, Clément Pit-Claudel,
	emacs-devel, Stefan Monnier, Stephen Leake

On Sat, Sep 4, 2021 at 8:04 PM Eli Zaretskii <eliz@gnu.org> wrote:
> No, because we want to be able to link against a TS library, we don't
> want to require people who build Emacs to build TS as well.

Related questions:
1. Who do we expect to build the TS library? For Linux I assume that would be
   the maintainer of the (system) package `libtree-sitter`. Is that correct?
2. Who do we expect to build the grammar binaries?




--
Tuấn-Anh Nguyễn
Software Engineer



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-04 14:49                                                                                                                                           ` Tuấn-Anh Nguyễn
@ 2021-09-04 15:00                                                                                                                                             ` Eli Zaretskii
  2021-09-05 16:34                                                                                                                                               ` Tuấn-Anh Nguyễn
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-09-04 15:00 UTC (permalink / raw)
  To: Tuấn-Anh Nguyễn
  Cc: casouri, theo, cpitclaudel, emacs-devel, monnier, stephen_leake

> From: Tuấn-Anh Nguyễn <ubolonton@gmail.com>
> Date: Sat, 4 Sep 2021 21:49:29 +0700
> Cc: Yuan Fu <casouri@gmail.com>, Theodor Thornhill <theo@thornhill.no>, 
> 	Stephen Leake <stephen_leake@stephe-leake.org>, 
> 	Clément Pit-Claudel <cpitclaudel@gmail.com>, 
> 	Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel <emacs-devel@gnu.org>
> 
> On Sat, Sep 4, 2021 at 8:04 PM Eli Zaretskii <eliz@gnu.org> wrote:
> > No, because we want to be able to link against a TS library, we don't
> > want to require people who build Emacs to build TS as well.
> 
> Related questions:
> 1. Who do we expect to build the TS library? For Linux I assume that would be
>    the maintainer of the (system) package `libtree-sitter`. Is that correct?

The distro, I'd say.  It can alwso be built on the user's machine and
installed separately.  Basically, the same as with any other optional
library we use: libpng, harfBuzz, etc.

> 2. Who do we expect to build the grammar binaries?

The ones that TS already provides?  They are already built, no?  Or
what do you mean by "build the grammar binaries", what kind of
binaries are those?  Forgive me my ignorance.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
       [not found]                                                                                                                                     ` <AF64EB2C-CCEC-4C98-8FE3-37697BEC9098@gmail.com>
  2021-09-04 12:49                                                                                                                                       ` Tuấn-Anh Nguyễn
@ 2021-09-04 15:14                                                                                                                                       ` Tuấn-Anh Nguyễn
  2021-09-04 15:33                                                                                                                                         ` Eli Zaretskii
  2021-09-04 15:39                                                                                                                                         ` Yuan Fu
  2021-09-05 21:15                                                                                                                                       ` Theodor Thornhill
  2 siblings, 2 replies; 284+ messages in thread
From: Tuấn-Anh Nguyễn @ 2021-09-04 15:14 UTC (permalink / raw)
  To: Yuan Fu
  Cc: Clément Pit-Claudel, Theodor Thornhill, emacs-devel,
	Stefan Monnier, Eli Zaretskii, Stephen Leake

On Sat, Sep 4, 2021 at 1:44 PM Yuan Fu <casouri@gmail.com> wrote:
> You can find the code at https://github.com/casouri/emacs.git, in “ts” branch. As long as tree-sitter library is in the standard path, Emacs will compile with tree-sitter support.

I'm not on a system with `libtree-sitter` available, so I built and installed it
from source:

    export PREFIX=/opt/local
    make
    sudo make install

It installed to `/opt/local/include` and `/opt/local/lib`, which are already on
my standard include/lib paths. However, I'm getting this error:

    configure: error: The following required libraries were not found:
         tree-sitter

Do you have more detailed instructions? For example, should the include/lib
paths/flags be set to some special values?

-- 
Tuấn-Anh Nguyễn
Software Engineer



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-04 12:49                                                                                                                                       ` Tuấn-Anh Nguyễn
  2021-09-04 13:04                                                                                                                                         ` Eli Zaretskii
@ 2021-09-04 15:31                                                                                                                                         ` Yuan Fu
  2021-09-05 16:45                                                                                                                                           ` Tuấn-Anh Nguyễn
  1 sibling, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-09-04 15:31 UTC (permalink / raw)
  To: Tuấn-Anh Nguyễn
  Cc: Clément Pit-Claudel, Theodor Thornhill, emacs-devel,
	Stefan Monnier, Eli Zaretskii, Stephen Leake

> 
>> You can find a script for building dynamic modules at https://github.com/casouri/tree-sitter-module
>> 
> I may be missing something here, but the grammars' compiled forms don't need to
> be Emacs dynamic modules, right? They only need to be dynamically-loadable
> shared libraries.

I packaged language definitions into dynamic modules: the system is there, why not take advantage of it? Do you think this approach can be improved in some way?

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-04 15:14                                                                                                                                       ` Tuấn-Anh Nguyễn
@ 2021-09-04 15:33                                                                                                                                         ` Eli Zaretskii
  2021-09-05 16:48                                                                                                                                           ` Tuấn-Anh Nguyễn
  2021-09-04 15:39                                                                                                                                         ` Yuan Fu
  1 sibling, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-09-04 15:33 UTC (permalink / raw)
  To: Tuấn-Anh Nguyễn
  Cc: casouri, theo, cpitclaudel, emacs-devel, monnier, stephen_leake

> From: Tuấn-Anh Nguyễn <ubolonton@gmail.com>
> Date: Sat, 4 Sep 2021 22:14:06 +0700
> Cc: Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  Theodor Thornhill <theo@thornhill.no>, emacs-devel <emacs-devel@gnu.org>,
>  Stefan Monnier <monnier@iro.umontreal.ca>, Eli Zaretskii <eliz@gnu.org>,
>  Stephen Leake <stephen_leake@stephe-leake.org>
> 
> On Sat, Sep 4, 2021 at 1:44 PM Yuan Fu <casouri@gmail.com> wrote:
> > You can find the code at https://github.com/casouri/emacs.git, in “ts” branch. As long as tree-sitter library is in the standard path, Emacs will compile with tree-sitter support.
> 
> I'm not on a system with `libtree-sitter` available, so I built and installed it
> from source:
> 
>     export PREFIX=/opt/local
>     make
>     sudo make install
> 
> It installed to `/opt/local/include` and `/opt/local/lib`, which are already on
> my standard include/lib paths. However, I'm getting this error:
> 
>     configure: error: The following required libraries were not found:
>          tree-sitter

Does config.log tell anything useful?



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-04 15:14                                                                                                                                       ` Tuấn-Anh Nguyễn
  2021-09-04 15:33                                                                                                                                         ` Eli Zaretskii
@ 2021-09-04 15:39                                                                                                                                         ` Yuan Fu
  1 sibling, 0 replies; 284+ messages in thread
From: Yuan Fu @ 2021-09-04 15:39 UTC (permalink / raw)
  To: Tuấn-Anh Nguyễn
  Cc: Clément Pit-Claudel, Theodor Thornhill, emacs-devel,
	Stefan Monnier, Eli Zaretskii, Stephen Leake



> On Sep 4, 2021, at 8:14 AM, Tuấn-Anh Nguyễn <ubolonton@gmail.com> wrote:
> 
> On Sat, Sep 4, 2021 at 1:44 PM Yuan Fu <casouri@gmail.com> wrote:
>> You can find the code at https://github.com/casouri/emacs.git, in “ts” branch. As long as tree-sitter library is in the standard path, Emacs will compile with tree-sitter support.
> 
> I'm not on a system with `libtree-sitter` available, so I built and installed it
> from source:
> 
>    export PREFIX=/opt/local
>    make
>    sudo make install
> 
> It installed to `/opt/local/include` and `/opt/local/lib`, which are already on
> my standard include/lib paths. However, I'm getting this error:
> 
>    configure: error: The following required libraries were not found:
>         tree-sitter
> 
> Do you have more detailed instructions? For example, should the include/lib
> paths/flags be set to some special values?

Not really, on my machine, tree-sitter is also installed in /opt/local, but I don’t see any problem building Emacs. Maybe give --libdir=DIR and --includedir=DIR a try and see if that helps.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-04 15:00                                                                                                                                             ` Eli Zaretskii
@ 2021-09-05 16:34                                                                                                                                               ` Tuấn-Anh Nguyễn
  2021-09-05 16:45                                                                                                                                                 ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Tuấn-Anh Nguyễn @ 2021-09-05 16:34 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Yuan Fu, Theodor Thornhill, Clément Pit-Claudel,
	emacs-devel, Stefan Monnier, Stephen Leake

On Sat, Sep 4, 2021 at 10:00 PM Eli Zaretskii <eliz@gnu.org> wrote:
> > 2. Who do we expect to build the grammar binaries?
>
> The ones that TS already provides?  They are already built, no?  Or
> what do you mean by "build the grammar binaries", what kind of
> binaries are those?  Forgive me my ignorance.

There are 2 components: TS the library (`libtree-sitter`) provides the generic
parts, not the grammars. The grammars come from various repositories, in source
form. (Some of them are owned by the tree-sitter project, some are not.) Each of
those consists of a generated `parser.c` and an optional `scanner.{c,cc}`. They
provide a function `TSLanguage (*tree_sitter_c) ()`, which specifies details on
how to parse a specific language (e.g. the parse table). They are usually
compiled into dynamically-loadable shared libraries (by a `tree-sitter` CLI
program), and distributed separately from `libtree-sitter`.

Tree-sitter has its own ABI versioning for these 2 components. It's easier to
ensure ABI compatibility if they are both built by the same system. That's the
case for GitHub's internal uses of tree-sitter. That's also the case with
`tree-sitter` and `tree-sitter-langs` packages on MELPA. That's not the case
with NeoVim's tree-sitter integration, and in a source of constant headache
AFAICT.

If we leave `libtree-sitter` to the distro, then it also makes sense for the
distro to provide the `tree-sitter` CLI program, and/or the grammar
binaries/sources.

--
Tuấn-Anh Nguyễn
Software Engineer



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-05 16:34                                                                                                                                               ` Tuấn-Anh Nguyễn
@ 2021-09-05 16:45                                                                                                                                                 ` Eli Zaretskii
  0 siblings, 0 replies; 284+ messages in thread
From: Eli Zaretskii @ 2021-09-05 16:45 UTC (permalink / raw)
  To: Tuấn-Anh Nguyễn
  Cc: casouri, theo, cpitclaudel, emacs-devel, monnier, stephen_leake

> From: Tuấn-Anh Nguyễn <ubolonton@gmail.com>
> Date: Sun, 5 Sep 2021 23:34:59 +0700
> Cc: Yuan Fu <casouri@gmail.com>, Theodor Thornhill <theo@thornhill.no>, 
> 	Stephen Leake <stephen_leake@stephe-leake.org>, 
> 	Clément Pit-Claudel <cpitclaudel@gmail.com>, 
> 	Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel <emacs-devel@gnu.org>
> 
> If we leave `libtree-sitter` to the distro, then it also makes sense for the
> distro to provide the `tree-sitter` CLI program, and/or the grammar
> binaries/sources.

Yes, of course.  And users who built TS themselves, will have to build
those grammar files as well.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-04 15:31                                                                                                                                         ` Yuan Fu
@ 2021-09-05 16:45                                                                                                                                           ` Tuấn-Anh Nguyễn
  2021-09-05 20:19                                                                                                                                             ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Tuấn-Anh Nguyễn @ 2021-09-05 16:45 UTC (permalink / raw)
  To: Yuan Fu
  Cc: Clément Pit-Claudel, Theodor Thornhill, emacs-devel,
	Stefan Monnier, Eli Zaretskii, Stephen Leake

On Sat, Sep 4, 2021 at 10:31 PM Yuan Fu <casouri@gmail.com> wrote:
> I packaged language definitions into dynamic modules: the system is there, why not take advantage of it? Do you think this approach can be improved in some way?

The language definitions just need to come from dynamically-loadable shared
libraries. They don't have to be Emacs dynamic modules, which bring additional
unnecessary complications, e.g. build difficulty, load path pollution, or
inability to load grammar binaries from other sources like distro's package
repos. It's better to just load the shared libs directly without going through
module machinery. Use the functions in `dynlib.h`.



-- 
Tuấn-Anh Nguyễn
Software Engineer



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-04 15:33                                                                                                                                         ` Eli Zaretskii
@ 2021-09-05 16:48                                                                                                                                           ` Tuấn-Anh Nguyễn
  0 siblings, 0 replies; 284+ messages in thread
From: Tuấn-Anh Nguyễn @ 2021-09-05 16:48 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Yuan Fu, Theodor Thornhill, Clément Pit-Claudel,
	emacs-devel, Stefan Monnier, Stephen Leake

On Sat, Sep 4, 2021 at 10:33 PM Eli Zaretskii <eliz@gnu.org> wrote:
> Does config.log tell anything useful?

Yeah, it showed that `pkg-config` could not find `tree-sitter`. It was a problem
with my custom setup. I can build it now. Thanks!

-- 
Tuấn-Anh Nguyễn
Software Engineer



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-05 16:45                                                                                                                                           ` Tuấn-Anh Nguyễn
@ 2021-09-05 20:19                                                                                                                                             ` Yuan Fu
  2021-09-06  0:03                                                                                                                                               ` Tuấn-Anh Nguyễn
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-09-05 20:19 UTC (permalink / raw)
  To: Tuấn-Anh Nguyễn
  Cc: Clément Pit-Claudel, Theodor Thornhill, emacs-devel,
	Stefan Monnier, Eli Zaretskii, Stephen Leake



> On Sep 5, 2021, at 9:45 AM, Tuấn-Anh Nguyễn <ubolonton@gmail.com> wrote:
> 
> On Sat, Sep 4, 2021 at 10:31 PM Yuan Fu <casouri@gmail.com> wrote:
>> I packaged language definitions into dynamic modules: the system is there, why not take advantage of it? Do you think this approach can be improved in some way?
> 
> The language definitions just need to come from dynamically-loadable shared
> libraries. They don't have to be Emacs dynamic modules, which bring additional
> unnecessary complications, e.g. build difficulty, load path pollution, or
> inability to load grammar binaries from other sources like distro's package
> repos. It's better to just load the shared libs directly without going through
> module machinery. Use the functions in `dynlib.h`.
> 

Dynamic modules comes with nice things, for example Emacs looks for them automatically in load-path; Emacs reports errors with it has problem loading one; I can package some additional information with the module; I could maybe distribute them through ordinary package.el facility, etc etc. If I load the shared library directly, I need to reinvent the wheels for loading, error reporting, searching in load-path, and others. On the other hand, dynamic modules don’t come with much complications. Yes, you need additional emacs-modules.h and tree-sitter-<lang>.c to build it, but that’s about it. And I was hoping to distribute pre-built modules anyway, so if all went well, ordinary users don’t need to compile the modules. WDYT?

P.S. what do you mean by “load path pollution”?
P.P.S. My impression is that other applications distribute language definitions by themselves, and it is not common for distort to package language definitions, is that correct?

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
       [not found]                                                                                                                                     ` <AF64EB2C-CCEC-4C98-8FE3-37697BEC9098@gmail.com>
  2021-09-04 12:49                                                                                                                                       ` Tuấn-Anh Nguyễn
  2021-09-04 15:14                                                                                                                                       ` Tuấn-Anh Nguyễn
@ 2021-09-05 21:15                                                                                                                                       ` Theodor Thornhill
  2021-09-05 23:58                                                                                                                                         ` Yuan Fu
  2 siblings, 1 reply; 284+ messages in thread
From: Theodor Thornhill @ 2021-09-05 21:15 UTC (permalink / raw)
  To: Yuan Fu
  Cc: Stefan Monnier, Eli Zaretskii, Clément Pit-Claudel,
	Stephen Leake, emacs-devel

Yuan Fu <casouri@gmail.com> writes:

> You can find the code at https://github.com/casouri/emacs.git, in “ts” branch. As long as tree-sitter library is in the standard path, Emacs will compile with tree-sitter support. Language definitions are loaded by dynamic modules. You can find a script for building dynamic modules at https://github.com/casouri/tree-sitter-module, you can even just grab the release file. I need to add C# to the list of languages in the build script.
>
> Now you have Emacs and dynamic modules. You can also build the manual, I just wrote the manual entries for tree-sitter API, its a draft but should explain everything the Emacs tree-sitter API provides. To build the manual you want to go to /doc/listpref, and do “make -e HTML_OPTS="--html” elisp.html”, that should compile a manual in elisp.html directory. The tree-sitter part is in “37 Parsing Program Source”. I attached a zip file containing the compiled manual on my machine, you can just use that.
>
> I haven’t written manual for font-lock and indent support, because they are not settled yet. To see how they work, you can read:
>
> 1. the source of ts-c-mode in /lisp/tree-sitter.el,
> 2. doctoring of font-lock-tree-sitter-defaults and font-lock-tree-sitter-settings, and 
> 3. docstring of tree-sitter-simple-indent-rules
>
> BTW, tree-sitter-inspect-mode could be helpful.
>
>> Right now we support the rust version on melpa, but I'd rather
>> move to this core-supported package.  How far are we from including this
>> in core,
>
> Some blockers that I can think of are 1) tree-sitter lacks a way to change its malloc behavior in run-time, I commented on their road-map issue, but no one has replied yet, 2) font-lock and indent support hasn’t settled, 3) writing, reviewing and editing the manual will take some time, 4) I need to work on a better way to build and distribute language dynamic modules.
>
>> and what can I do to help?
>
> For a starter, could you perhaps have a look at the indentation system (tree-sitter-simple-indent-rules and friends), and tell me if anything is lacking? Too complex, not powerful enough, etc. Do you have any suggestions? The same goes for font-lock.
>
> Also, I’m happy to hear your suggestions on the general tree-sitter API and the manual.
>


Thank you for your thorough instructions.  I've been able to compile it
on my system, but I'm having trouble with the c-sharp module.  I get
this error:

--------------------------------------------

Cloning into 'tree-sitter-c-sharp'...
remote: Enumerating objects: 62, done.
remote: Counting objects: 100% (62/62), done.
remote: Compressing objects: 100% (57/57), done.
remote: Total 62 (delta 17), reused 19 (delta 0), pack-reused 0
Receiving objects: 100% (62/62), 831.92 KiB | 2.12 MiB/s, done.
Resolving deltas: 100% (17/17), done.
tree-sitter-c-sharp.c:7:33: error: expected ';' after top level declarator
extern TSLanguage *tree_sitter_c-sharp(void);
                                ^
                                ;
tree-sitter-c-sharp.c:16:40: error: implicit declaration of function 'sharp' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
  TSLanguage *language = tree_sitter_c-sharp();
                                       ^
2 errors generated.
binding.cc:2:10: fatal error: 'node.h' file not found
#include <node.h>
         ^~~~~~~~
1 error generated.

----------------------------------------------

I'm guessing this is due to the hyphen in the function name.  I remember
we had to do some shenanigans in the rust variant some time ago to
translate this properly.  In the C files we need to use underscore
rather than hyphen, yes?  If this isn't too hard to do I guess I can try
to make a PR to your project, otherwise you at least have a bugreport
here :)  I'm also looking into the code now, and it looks nice so far.
I'll come back to you when I have something more!

Theodor



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-05 21:15                                                                                                                                       ` Theodor Thornhill
@ 2021-09-05 23:58                                                                                                                                         ` Yuan Fu
  0 siblings, 0 replies; 284+ messages in thread
From: Yuan Fu @ 2021-09-05 23:58 UTC (permalink / raw)
  To: Theodor Thornhill
  Cc: Stephen Leake, Eli Zaretskii, Clément Pit-Claudel,
	Stefan Monnier, emacs-devel

> 
> Thank you for your thorough instructions.  I've been able to compile it
> on my system, but I'm having trouble with the c-sharp module.  I get
> this error:
> 
> --------------------------------------------
> 
> Cloning into 'tree-sitter-c-sharp'...
> remote: Enumerating objects: 62, done.
> remote: Counting objects: 100% (62/62), done.
> remote: Compressing objects: 100% (57/57), done.
> remote: Total 62 (delta 17), reused 19 (delta 0), pack-reused 0
> Receiving objects: 100% (62/62), 831.92 KiB | 2.12 MiB/s, done.
> Resolving deltas: 100% (17/17), done.
> tree-sitter-c-sharp.c:7:33: error: expected ';' after top level declarator
> extern TSLanguage *tree_sitter_c-sharp(void);
>                                ^
>                                ;
> tree-sitter-c-sharp.c:16:40: error: implicit declaration of function 'sharp' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
>  TSLanguage *language = tree_sitter_c-sharp();
>                                       ^
> 2 errors generated.
> binding.cc:2:10: fatal error: 'node.h' file not found
> #include <node.h>
>         ^~~~~~~~
> 1 error generated.
> 
> ----------------------------------------------
> 
> I'm guessing this is due to the hyphen in the function name.  I remember
> we had to do some shenanigans in the rust variant some time ago to
> translate this properly.  In the C files we need to use underscore
> rather than hyphen, yes?  If this isn't too hard to do I guess I can try
> to make a PR to your project, otherwise you at least have a bugreport
> here :)  I'm also looking into the code now, and it looks nice so far.
> I'll come back to you when I have something more!

Thanks for trying out and reporting :-) I’ve fixed the build script and it now should build c-sharp.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-05 20:19                                                                                                                                             ` Yuan Fu
@ 2021-09-06  0:03                                                                                                                                               ` Tuấn-Anh Nguyễn
  2021-09-06  0:23                                                                                                                                                 ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Tuấn-Anh Nguyễn @ 2021-09-06  0:03 UTC (permalink / raw)
  To: Yuan Fu
  Cc: Clément Pit-Claudel, Theodor Thornhill, emacs-devel,
	Stefan Monnier, Eli Zaretskii, Stephen Leake

On Mon, Sep 6, 2021 at 3:19 AM Yuan Fu <casouri@gmail.com> wrote:
> Dynamic modules comes with nice things, for example Emacs looks for them automatically in load-path; Emacs reports errors with it has problem loading one; On the other hand, dynamic modules don’t come with much complications. Yes, you need additional emacs-modules.h and tree-sitter-<lang>.c to build it, but that’s about it.

See my other discussion with Eli. We want to rely on the distro to provide the
binaries and the `tree-sitter` CLI program, and to be able to use shared libs
from other sources as well (like self-built). They are not going to be Emacs
dynamic modules.

> I can package some additional information with the module; I could maybe distribute them through ordinary package.el facility, etc etc.

Neither of these requires it to be a module at all. (Also note that package.el
isn't able to handle platform-specific files at the moment.)

> If I load the shared library directly, I need to reinvent the wheels for loading, error reporting, searching in load-path, and others.

The non-module-specific part of loading is provided by `dynlib.h`. There's no
wheel to reinvent here. What error reporting do you mean? (You are going to need
additional checks for ABI compatibility anyway.) Searching a load path (not the
`load-path`) is not that complicated. What are the others?

> And I was hoping to distribute pre-built modules anyway, so if all went well, ordinary users don’t need to compile the modules. WDYT?

It's good to provide that convenience, but it should not be at the expense of
not being able to use binaries from other sources, or to build the binaries on
their own. The `tree-sitter-langs` package already enables both of these. It
provides both pre-built binaries and functions for users to compile on their
own. And it does so without putting language definitions in dynamic modules.

> P.S. what do you mean by “load path pollution”?

I meant to say load path collision, but since you use `tree-sitter-{lang}` for
the module name, that's less of a problem. Load path pollution is these names
showing up when the user enumerates entries on the load path trying to go to the
source of a Lisp library. That's annoying, but bearable.

> P.P.S. My impression is that other applications distribute language definitions by themselves, and it is not common for distort to package language definitions, is that correct?

I don't understand this. Can you rephrase it?

All in all, you are severely underestimating the amount of complexity and wheels
you will have to reinvent in other places compared to the amount of code you
don't have to write by requiring language definitions to be in dynamic modules.
(It's less than 100, most of which is docstrings and comments.)

--
Tuấn-Anh Nguyễn
Software Engineer



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-06  0:03                                                                                                                                               ` Tuấn-Anh Nguyễn
@ 2021-09-06  0:23                                                                                                                                                 ` Yuan Fu
  2021-09-06  5:33                                                                                                                                                   ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-09-06  0:23 UTC (permalink / raw)
  To: Tuấn-Anh Nguyễn
  Cc: Clément Pit-Claudel, Theodor Thornhill, emacs-devel,
	Stefan Monnier, Eli Zaretskii, Stephen Leake



> On Sep 5, 2021, at 5:03 PM, Tuấn-Anh Nguyễn <ubolonton@gmail.com> wrote:
> 
> On Mon, Sep 6, 2021 at 3:19 AM Yuan Fu <casouri@gmail.com> wrote:
>> Dynamic modules comes with nice things, for example Emacs looks for them automatically in load-path; Emacs reports errors with it has problem loading one; On the other hand, dynamic modules don’t come with much complications. Yes, you need additional emacs-modules.h and tree-sitter-<lang>.c to build it, but that’s about it.
> 
> See my other discussion with Eli. We want to rely on the distro to provide the
> binaries and the `tree-sitter` CLI program, and to be able to use shared libs
> from other sources as well (like self-built). They are not going to be Emacs
> dynamic modules.
> 
>> I can package some additional information with the module; I could maybe distribute them through ordinary package.el facility, etc etc.
> 
> Neither of these requires it to be a module at all. (Also note that package.el
> isn't able to handle platform-specific files at the moment.)
> 
>> If I load the shared library directly, I need to reinvent the wheels for loading, error reporting, searching in load-path, and others.
> 
> The non-module-specific part of loading is provided by `dynlib.h`. There's no
> wheel to reinvent here. What error reporting do you mean? (You are going to need
> additional checks for ABI compatibility anyway.) Searching a load path (not the
> `load-path`) is not that complicated. What are the others?
> 
>> And I was hoping to distribute pre-built modules anyway, so if all went well, ordinary users don’t need to compile the modules. WDYT?
> 
> It's good to provide that convenience, but it should not be at the expense of
> not being able to use binaries from other sources, or to build the binaries on
> their own. The `tree-sitter-langs` package already enables both of these. It
> provides both pre-built binaries and functions for users to compile on their
> own. And it does so without putting language definitions in dynamic modules.
> 
>> P.S. what do you mean by “load path pollution”?
> 
> I meant to say load path collision, but since you use `tree-sitter-{lang}` for
> the module name, that's less of a problem. Load path pollution is these names
> showing up when the user enumerates entries on the load path trying to go to the
> source of a Lisp library. That's annoying, but bearable.
> 
>> P.P.S. My impression is that other applications distribute language definitions by themselves, and it is not common for distort to package language definitions, is that correct?
> 
> I don't understand this. Can you rephrase it?
> 
> All in all, you are severely underestimating the amount of complexity and wheels
> you will have to reinvent in other places compared to the amount of code you
> don't have to write by requiring language definitions to be in dynamic modules.
> (It's less than 100, most of which is docstrings and comments.)

I see your point. If no one else object, I’ll change the code to use shared libraries instead of dynamic modules. Thanks for the input :-)

Yuan




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-06  0:23                                                                                                                                                 ` Yuan Fu
@ 2021-09-06  5:33                                                                                                                                                   ` Eli Zaretskii
  2021-09-07 15:38                                                                                                                                                     ` Tuấn-Anh Nguyễn
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-09-06  5:33 UTC (permalink / raw)
  To: Yuan Fu; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake

> From: Yuan Fu <casouri@gmail.com>
> Date: Sun, 5 Sep 2021 17:23:33 -0700
> Cc: Theodor Thornhill <theo@thornhill.no>,
>  Stephen Leake <stephen_leake@stephe-leake.org>,
>  Eli Zaretskii <eliz@gnu.org>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  Stefan Monnier <monnier@iro.umontreal.ca>,
>  emacs-devel <emacs-devel@gnu.org>
> 
> I see your point. If no one else object, I’ll change the code to use shared libraries instead of dynamic modules. Thanks for the input :-)

Can we please stop for a moment and describe what exactly is required
for loading a language module?  I think it would be good to have that
documented in this discussion for posterity, and so that we make sure
we are all on the same page.

I understand that a language module gets compiled into a shared
library, either as part of building TS or separately.  But what should
Emacs do to "load" the module, and when should it do that?  And how do
we intend to handle the situation where a module is needed, but is not
available (i.e. its loading fails)?

Emacs has a load-on-demand infrastructure for shared libraries, but it
only exists on MS-Windows, where we support installations of Emacs
binaries without some of the optional libraries, and want to handle
that gracefully.  However, this doesn't seem to be a similar
situation; for starters, load-on-demand needs to know at Emacs build
time the names of entry points (functions and variables) we need to
import from each shared library.  So I guess we are talking about some
(slightly) different mechanism here?

Thanks.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-06  5:33                                                                                                                                                   ` Eli Zaretskii
@ 2021-09-07 15:38                                                                                                                                                     ` Tuấn-Anh Nguyễn
  2021-09-07 16:16                                                                                                                                                       ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Tuấn-Anh Nguyễn @ 2021-09-07 15:38 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Yuan Fu, Theodor Thornhill, Clément Pit-Claudel,
	emacs-devel, Stefan Monnier, Stephen Leake

On Mon, Sep 6, 2021 at 12:33 PM Eli Zaretskii <eliz@gnu.org> wrote:
> I understand that a language module gets compiled into a shared
> library, either as part of building TS or separately.  But what should
> Emacs do to "load" the module, and when should it do that?  And how do
> we intend to handle the situation where a module is needed, but is not
> available (i.e. its loading fails)?

Emacs should "load" the module when it's asked to do so, by a function, e.g.
`tree-sitter-load-lang`. When loading fails, it should signal an error.

To locate the module, I think there are 2 possible approaches:
1. Emacs consults a new search path variable to look for the module, which is
  named `<lang>[.ext]`, and calls `dynlib_open` with the absolute path.
2. Emacs calls `dynlib_open` with the basename `tree-sitter-<lang>[.ext]`,
  relying on the module being correctly put on the system's library search path,
  e.g. by the distro's package manager.

Option 2 sounds better to me, but option 1 is how people do it at the moment.
(And no distro has packaged these AFAICT.)

> Emacs has a load-on-demand infrastructure for shared libraries, but it
> only exists on MS-Windows, where we support installations of Emacs
> binaries without some of the optional libraries, and want to handle
> that gracefully.  However, this doesn't seem to be a similar
> situation; for starters, load-on-demand needs to know at Emacs build
> time the names of entry points (functions and variables) we need to
> import from each shared library.  So I guess we are talking about some
> (slightly) different mechanism here?

For each language, the entry point is a single function `TSLanguage
(*tree_sitter_<lang>) ()`,
where `lang` is the name declared in the grammar's DSL source. It's ensured by
the parser generator (the `tree-sitter` CLI program).

-- 
Tuấn-Anh Nguyễn
Software Engineer



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-07 15:38                                                                                                                                                     ` Tuấn-Anh Nguyễn
@ 2021-09-07 16:16                                                                                                                                                       ` Eli Zaretskii
  2021-09-08  3:06                                                                                                                                                         ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-09-07 16:16 UTC (permalink / raw)
  To: Tuấn-Anh Nguyễn
  Cc: casouri, theo, cpitclaudel, emacs-devel, monnier, stephen_leake

> From: Tuấn-Anh Nguyễn <ubolonton@gmail.com>
> Date: Tue, 7 Sep 2021 22:38:52 +0700
> Cc: Yuan Fu <casouri@gmail.com>, Theodor Thornhill <theo@thornhill.no>, 
> 	Stephen Leake <stephen_leake@stephe-leake.org>, 
> 	Clément Pit-Claudel <cpitclaudel@gmail.com>, 
> 	Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel <emacs-devel@gnu.org>
> 
> On Mon, Sep 6, 2021 at 12:33 PM Eli Zaretskii <eliz@gnu.org> wrote:
> > I understand that a language module gets compiled into a shared
> > library, either as part of building TS or separately.  But what should
> > Emacs do to "load" the module, and when should it do that?  And how do
> > we intend to handle the situation where a module is needed, but is not
> > available (i.e. its loading fails)?
> 
> Emacs should "load" the module when it's asked to do so, by a function, e.g.
> `tree-sitter-load-lang`. When loading fails, it should signal an error.

So this has to be an explicit load initiated by a Lisp program?  How
would that program know which module to load for a given language?  (I
thought TS would load the module it needs whenever support for a
language is requested.)

> To locate the module, I think there are 2 possible approaches:
> 1. Emacs consults a new search path variable to look for the module, which is
>   named `<lang>[.ext]`, and calls `dynlib_open` with the absolute path.
> 2. Emacs calls `dynlib_open` with the basename `tree-sitter-<lang>[.ext]`,
>   relying on the module being correctly put on the system's library search path,
>   e.g. by the distro's package manager.
> 
> Option 2 sounds better to me, but option 1 is how people do it at the moment.
> (And no distro has packaged these AFAICT.)

I think 2 is better, since we are relying on others to build and
package these modules.

Thanks.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-07 16:16                                                                                                                                                       ` Eli Zaretskii
@ 2021-09-08  3:06                                                                                                                                                         ` Yuan Fu
  2021-09-10  2:06                                                                                                                                                           ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-09-08  3:06 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Tuấn-Anh Nguyễn, Theodor Thornhill,
	Clément Pit-Claudel, emacs-devel, Stefan Monnier,
	Stephen Leake

>> 
>> Emacs should "load" the module when it's asked to do so, by a function, e.g.
>> `tree-sitter-load-lang`. When loading fails, it should signal an error.
> 
> So this has to be an explicit load initiated by a Lisp program?  How
> would that program know which module to load for a given language?  (I
> thought TS would load the module it needs whenever support for a
> language is requested.)

TS doesn’t load the module, it expects the user to pass it a pointer to the language definition. How does the user get the language definition is not its business. The user is supposed to combine TS and a language definition to create a workable parser. See:

bool ts_parser_set_language(TSParser *self, const TSLanguage *language);

TS only wants a pointer to a TSLanguage.

All the language modules have regular names, i.e., tree-sitter-<lang>.so, so I think we can just calculate the name from the language name; or we can add a backup: use an alist to map language names to module names to cover possible irregular names.

Yuan




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-08  3:06                                                                                                                                                         ` Yuan Fu
@ 2021-09-10  2:06                                                                                                                                                           ` Yuan Fu
  2021-09-10  6:32                                                                                                                                                             ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-09-10  2:06 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Tuấn-Anh Nguyễn, Theodor Thornhill,
	Clément Pit-Claudel, emacs-devel, Stefan Monnier,
	Stephen Leake



> On Sep 7, 2021, at 8:06 PM, Yuan Fu <casouri@gmail.com> wrote:
> 
>>> 
>>> Emacs should "load" the module when it's asked to do so, by a function, e.g.
>>> `tree-sitter-load-lang`. When loading fails, it should signal an error.
>> 
>> So this has to be an explicit load initiated by a Lisp program?  How
>> would that program know which module to load for a given language?  (I
>> thought TS would load the module it needs whenever support for a
>> language is requested.)
> 
> TS doesn’t load the module, it expects the user to pass it a pointer to the language definition. How does the user get the language definition is not its business. The user is supposed to combine TS and a language definition to create a workable parser. See:
> 
> bool ts_parser_set_language(TSParser *self, const TSLanguage *language);
> 
> TS only wants a pointer to a TSLanguage.
> 
> All the language modules have regular names, i.e., tree-sitter-<lang>.so, so I think we can just calculate the name from the language name; or we can add a backup: use an alist to map language names to module names to cover possible irregular names.

If you think it’s fine, Eli, I’ll start working on this.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-10  2:06                                                                                                                                                           ` Yuan Fu
@ 2021-09-10  6:32                                                                                                                                                             ` Eli Zaretskii
  2021-09-10 19:57                                                                                                                                                               ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-09-10  6:32 UTC (permalink / raw)
  To: Yuan Fu; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake

> From: Yuan Fu <casouri@gmail.com>
> Date: Thu, 9 Sep 2021 19:06:28 -0700
> Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>,
>  Theodor Thornhill <theo@thornhill.no>,
>  Stephen Leake <stephen_leake@stephe-leake.org>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  Stefan Monnier <monnier@iro.umontreal.ca>,
>  emacs-devel@gnu.org
> 
> > TS doesn’t load the module, it expects the user to pass it a pointer to the language definition. How does the user get the language definition is not its business. The user is supposed to combine TS and a language definition to create a workable parser. See:
> > 
> > bool ts_parser_set_language(TSParser *self, const TSLanguage *language);
> > 
> > TS only wants a pointer to a TSLanguage.
> > 
> > All the language modules have regular names, i.e., tree-sitter-<lang>.so, so I think we can just calculate the name from the language name; or we can add a backup: use an alist to map language names to module names to cover possible irregular names.
> 
> If you think it’s fine, Eli, I’ll start working on this.

Sure.  I guess we will have to have a database of module names for
each programming language somewhere?



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-10  6:32                                                                                                                                                             ` Eli Zaretskii
@ 2021-09-10 19:57                                                                                                                                                               ` Yuan Fu
  2021-09-11  3:41                                                                                                                                                                 ` Tuấn-Anh Nguyễn
  2021-09-11  5:51                                                                                                                                                                 ` Eli Zaretskii
  0 siblings, 2 replies; 284+ messages in thread
From: Yuan Fu @ 2021-09-10 19:57 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Tuấn-Anh Nguyễn, Theodor Thornhill,
	Clément Pit-Claudel, emacs-devel, Stefan Monnier,
	Stephen Leake

> 
> Sure.  I guess we will have to have a database of module names for
> each programming language somewhere?

My plan is to translate lisp names to C names by default, and have an override list for irregular names that can’t be translated correctly.

Just realized another problem, how do we make sure the loaded library is GPL-compatible? There certainly won’t be “plugin_is_GPL_compatible” symbol in them… And IIUC Emacs cannot load GPL-incompatible dynamic libraries?

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-10 19:57                                                                                                                                                               ` Yuan Fu
@ 2021-09-11  3:41                                                                                                                                                                 ` Tuấn-Anh Nguyễn
  2021-09-11  4:11                                                                                                                                                                   ` Yuan Fu
  2021-09-11  5:51                                                                                                                                                                 ` Eli Zaretskii
  1 sibling, 1 reply; 284+ messages in thread
From: Tuấn-Anh Nguyễn @ 2021-09-11  3:41 UTC (permalink / raw)
  To: Yuan Fu
  Cc: Clément Pit-Claudel, Theodor Thornhill, emacs-devel,
	Stefan Monnier, Eli Zaretskii, Stephen Leake

> Just realized another problem, how do we make sure the loaded library is GPL-compatible?

This question is rather non-technical, so I can't provide any comments.

> There certainly won’t be “plugin_is_GPL_compatible” symbol in them… And IIUC Emacs cannot load GPL-incompatible dynamic libraries?

That's one of the reasons for using `dynlib.h` APIs directly. The check for
that symbol is at the level of `emacs-module.c`. Let's not conceptually
conflate a "shared library" and an "Emacs dynamic module".


-- 
Tuấn-Anh Nguyễn
Software Engineer



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-11  3:41                                                                                                                                                                 ` Tuấn-Anh Nguyễn
@ 2021-09-11  4:11                                                                                                                                                                   ` Yuan Fu
  2021-09-11  7:23                                                                                                                                                                     ` Tuấn-Anh Nguyễn
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-09-11  4:11 UTC (permalink / raw)
  To: Tuấn-Anh Nguyễn
  Cc: Clément Pit-Claudel, Theodor Thornhill, emacs-devel,
	Stefan Monnier, Eli Zaretskii, Stephen Leake



> On Sep 10, 2021, at 8:41 PM, Tuấn-Anh Nguyễn <ubolonton@gmail.com> wrote:
> 
>> Just realized another problem, how do we make sure the loaded library is GPL-compatible?
> 
> This question is rather non-technical, so I can't provide any comments.
> 
>> There certainly won’t be “plugin_is_GPL_compatible” symbol in them… And IIUC Emacs cannot load GPL-incompatible dynamic libraries?
> 
> That's one of the reasons for using `dynlib.h` APIs directly. The check for
> that symbol is at the level of `emacs-module.c`. Let's not conceptually
> conflate a "shared library" and an "Emacs dynamic module”.

I think you have it backwards. IIUC the reason why every Emacs dynamic module declares “plugin_is_GPL_compatible” is that every shared library that links with Emacs must be GPL compatible, and an Emacs dynamic module is a shared library. But that’s just my understanding, of course. I’m happy to be corrected.

Yuan





^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-10 19:57                                                                                                                                                               ` Yuan Fu
  2021-09-11  3:41                                                                                                                                                                 ` Tuấn-Anh Nguyễn
@ 2021-09-11  5:51                                                                                                                                                                 ` Eli Zaretskii
  2021-09-11 19:00                                                                                                                                                                   ` Yuan Fu
  1 sibling, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-09-11  5:51 UTC (permalink / raw)
  To: Yuan Fu; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake

> From: Yuan Fu <casouri@gmail.com>
> Date: Fri, 10 Sep 2021 12:57:22 -0700
> Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>,
>  Theodor Thornhill <theo@thornhill.no>,
>  Stephen Leake <stephen_leake@stephe-leake.org>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  Stefan Monnier <monnier@iro.umontreal.ca>,
>  emacs-devel@gnu.org
> 
> > Sure.  I guess we will have to have a database of module names for
> > each programming language somewhere?
> 
> My plan is to translate lisp names to C names by default, and have an override list for irregular names that can’t be translated correctly.

What are "Lisp names" in this context?  Are you saying that the name
of a programming language, derived from the major mode, can be used to
produce the name of the shared library programmatically?  If so, how?

> Just realized another problem, how do we make sure the loaded library is GPL-compatible? There certainly won’t be “plugin_is_GPL_compatible” symbol in them… And IIUC Emacs cannot load GPL-incompatible dynamic libraries?

That's only needed for Emacs modules, not for external libraries that
provide some extra functionality on the level of primitives.  For
those, we just make sure their license is compatible with GPL.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-11  4:11                                                                                                                                                                   ` Yuan Fu
@ 2021-09-11  7:23                                                                                                                                                                     ` Tuấn-Anh Nguyễn
  2021-09-11 19:02                                                                                                                                                                       ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Tuấn-Anh Nguyễn @ 2021-09-11  7:23 UTC (permalink / raw)
  To: Yuan Fu
  Cc: Clément Pit-Claudel, Theodor Thornhill, emacs-devel,
	Stefan Monnier, Eli Zaretskii, Stephen Leake

> I think you have it backwards. IIUC the reason why every Emacs dynamic module declares “plugin_is_GPL_compatible” is that every shared library that links with Emacs must be GPL compatible, and an Emacs dynamic module is a shared library. But that’s just my understanding, of course. I’m happy to be corrected.

That understanding is wrong. To help you understand better: every Emacs dynamic
module is a shared library, but the opposite is not true. If you are still
confused, read the relevant parts in `emacs-module.c`. On another note, shared
libraries in general don't "link" with Emacs. "Linking" has very specific and
precise technical meanings in this context. Please read up on that, starting
from "dynamic linking vs. dynamic loading."

-- 
Tuấn-Anh Nguyễn
Software Engineer



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-11  5:51                                                                                                                                                                 ` Eli Zaretskii
@ 2021-09-11 19:00                                                                                                                                                                   ` Yuan Fu
  2021-09-11 19:14                                                                                                                                                                     ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-09-11 19:00 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Tuấn-Anh Nguyễn, theo, cpitclaudel, emacs-devel,
	monnier, stephen_leake



> On Sep 10, 2021, at 10:51 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: Yuan Fu <casouri@gmail.com>
>> Date: Fri, 10 Sep 2021 12:57:22 -0700
>> Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>,
>> Theodor Thornhill <theo@thornhill.no>,
>> Stephen Leake <stephen_leake@stephe-leake.org>,
>> Clément Pit-Claudel <cpitclaudel@gmail.com>,
>> Stefan Monnier <monnier@iro.umontreal.ca>,
>> emacs-devel@gnu.org
>> 
>>> Sure.  I guess we will have to have a database of module names for
>>> each programming language somewhere?
>> 
>> My plan is to translate lisp names to C names by default, and have an override list for irregular names that can’t be translated correctly.
> 
> What are "Lisp names" in this context?  Are you saying that the name
> of a programming language, derived from the major mode, can be used to
> produce the name of the shared library programmatically?  If so, how?

I don’t think it’s a rule, but language definitions are conventionally named tree-sitter-<lang>. E.g. tree-sitter-c, tree-sitter-json, tree-sitter-c-sharp. And the symbol they expose are tree_sitter_<lang>, e.g., tree_sitter_c, tree_sitter_jon, tree_sitter_c_sharp. Currently we use a symbol tree-sitter-<lang> to represent a language, so we can translate the symbol tree-sitter-<lang> to tree-sitter-<lang>.so/dylib/dll to get the shared library name, and to tree_sitter_<lang> to get the C symbol name.

BTW, since dynamic libraries has different extensions on different systems, what I want to do it to try loading the library with .so, then try .dylib, then try .dll, is that a good idea?

> 
>> Just realized another problem, how do we make sure the loaded library is GPL-compatible? There certainly won’t be “plugin_is_GPL_compatible” symbol in them… And IIUC Emacs cannot load GPL-incompatible dynamic libraries?
> 
> That's only needed for Emacs modules, not for external libraries that
> provide some extra functionality on the level of primitives.  For
> those, we just make sure their license is compatible with GPL.

Thanks, that’s all I need to know.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-11  7:23                                                                                                                                                                     ` Tuấn-Anh Nguyễn
@ 2021-09-11 19:02                                                                                                                                                                       ` Yuan Fu
  0 siblings, 0 replies; 284+ messages in thread
From: Yuan Fu @ 2021-09-11 19:02 UTC (permalink / raw)
  To: Tuấn-Anh Nguyễn
  Cc: Clément Pit-Claudel, Theodor Thornhill, emacs-devel,
	Stefan Monnier, Eli Zaretskii, Stephen Leake



> On Sep 11, 2021, at 12:23 AM, Tuấn-Anh Nguyễn <ubolonton@gmail.com> wrote:
> 
>> I think you have it backwards. IIUC the reason why every Emacs dynamic module declares “plugin_is_GPL_compatible” is that every shared library that links with Emacs must be GPL compatible, and an Emacs dynamic module is a shared library. But that’s just my understanding, of course. I’m happy to be corrected.
> 
> That understanding is wrong. To help you understand better: every Emacs dynamic
> module is a shared library, but the opposite is not true. If you are still
> confused, read the relevant parts in `emacs-module.c`. On another note, shared
> libraries in general don't "link" with Emacs. "Linking" has very specific and
> precise technical meanings in this context. Please read up on that, starting
> from "dynamic linking vs. dynamic loading.”

I see, thanks for the explanation. Anyway, I’m glad there isn’t an issue. 

Yuan




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-11 19:00                                                                                                                                                                   ` Yuan Fu
@ 2021-09-11 19:14                                                                                                                                                                     ` Eli Zaretskii
  2021-09-11 19:17                                                                                                                                                                       ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-09-11 19:14 UTC (permalink / raw)
  To: Yuan Fu; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake

> From: Yuan Fu <casouri@gmail.com>
> Date: Sat, 11 Sep 2021 12:00:59 -0700
> Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>,
>  theo@thornhill.no,
>  stephen_leake@stephe-leake.org,
>  cpitclaudel@gmail.com,
>  monnier@iro.umontreal.ca,
>  emacs-devel@gnu.org
> 
> >> My plan is to translate lisp names to C names by default, and have an override list for irregular names that can’t be translated correctly.
> > 
> > What are "Lisp names" in this context?  Are you saying that the name
> > of a programming language, derived from the major mode, can be used to
> > produce the name of the shared library programmatically?  If so, how?
> 
> I don’t think it’s a rule, but language definitions are conventionally named tree-sitter-<lang>. E.g. tree-sitter-c, tree-sitter-json, tree-sitter-c-sharp. And the symbol they expose are tree_sitter_<lang>, e.g., tree_sitter_c, tree_sitter_jon, tree_sitter_c_sharp. Currently we use a symbol tree-sitter-<lang> to represent a language, so we can translate the symbol tree-sitter-<lang> to tree-sitter-<lang>.so/dylib/dll to get the shared library name, and to tree_sitter_<lang> to get the C symbol name.

But the <lang> part is still needed to be concocted somehow.  E.g.,
the conversion from "C#" to "c-sharp" isn't trivial.

> BTW, since dynamic libraries has different extensions on different systems, what I want to do it to try loading the library with .so, then try .dylib, then try .dll, is that a good idea?

We can do better, see load-suffixes.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-11 19:14                                                                                                                                                                     ` Eli Zaretskii
@ 2021-09-11 19:17                                                                                                                                                                       ` Eli Zaretskii
  2021-09-11 20:29                                                                                                                                                                         ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-09-11 19:17 UTC (permalink / raw)
  To: casouri; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake

> Date: Sat, 11 Sep 2021 22:14:26 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: ubolonton@gmail.com, theo@thornhill.no, cpitclaudel@gmail.com,
>  emacs-devel@gnu.org, monnier@iro.umontreal.ca, stephen_leake@stephe-leake.org
> 
> > BTW, since dynamic libraries has different extensions on different systems, what I want to do it to try loading the library with .so, then try .dylib, then try .dll, is that a good idea?
> 
> We can do better, see load-suffixes.

And in C, you can use MODULES_SUFFIX directly.  Though we will
probably need some minor changes there, to have the suffix defined
even in a build --without-modules.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-11 19:17                                                                                                                                                                       ` Eli Zaretskii
@ 2021-09-11 20:29                                                                                                                                                                         ` Yuan Fu
  2021-09-12  5:39                                                                                                                                                                           ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-09-11 20:29 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Tuấn-Anh Nguyễn, Theodor Thornhill,
	Clément Pit-Claudel, Emacs developers, Stefan Monnier,
	stephen_leake


> On Sep 11, 2021, at 12:14 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
> But the <lang> part is still needed to be concocted somehow.  E.g.,
> the conversion from "C#" to "c-sharp" isn't trivial.
> 

The project name of tree-sitter’s C# definition is “tree-sitter-c-sharp”[1]. So if someone wants to use the C# language, they probably know what symbol represents it (we will explain the translation rule in doc-string and the manual). I also want to point out that we don’t come up with the symbols representing each language, the _user_ passes 'tree-sitter-parser-create' a symbol representing a language, and we translate that symbol to dynamic library name and C symbol name.

[1]: https://github.com/tree-sitter/tree-sitter-c-sharp


> On Sep 11, 2021, at 12:17 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> Date: Sat, 11 Sep 2021 22:14:26 +0300
>> From: Eli Zaretskii <eliz@gnu.org>
>> Cc: ubolonton@gmail.com, theo@thornhill.no, cpitclaudel@gmail.com,
>> emacs-devel@gnu.org, monnier@iro.umontreal.ca, stephen_leake@stephe-leake.org
>> 
>>> BTW, since dynamic libraries has different extensions on different systems, what I want to do it to try loading the library with .so, then try .dylib, then try .dll, is that a good idea?
>> 
>> We can do better, see load-suffixes.
> 
> And in C, you can use MODULES_SUFFIX directly.  Though we will
> probably need some minor changes there, to have the suffix defined
> even in a build --without-modules.

I’m using tree-sitter-load-suffixes with default value ‘(“.so”, “.dylib”, “.dll”). Should I populate this variable with MODULES_SUFFIX and MODULES_SECONDARY_SUFFIX, or should I just use the two SUFFIX in C? I.e., do you see a need for users to customize suffixes?

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-11 20:29                                                                                                                                                                         ` Yuan Fu
@ 2021-09-12  5:39                                                                                                                                                                           ` Eli Zaretskii
  2021-09-13  4:15                                                                                                                                                                             ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-09-12  5:39 UTC (permalink / raw)
  To: Yuan Fu; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake

> From: Yuan Fu <casouri@gmail.com>
> Date: Sat, 11 Sep 2021 13:29:09 -0700
> Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>,
>  Theodor Thornhill <theo@thornhill.no>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  Emacs developers <emacs-devel@gnu.org>,
>  Stefan Monnier <monnier@iro.umontreal.ca>,
>  stephen_leake@stephe-leake.org
> 
> > But the <lang> part is still needed to be concocted somehow.  E.g.,
> > the conversion from "C#" to "c-sharp" isn't trivial.
> 
> The project name of tree-sitter’s C# definition is “tree-sitter-c-sharp”[1]. So if someone wants to use the C# language, they probably know what symbol represents it (we will explain the translation rule in doc-string and the manual). I also want to point out that we don’t come up with the symbols representing each language, the _user_ passes 'tree-sitter-parser-create' a symbol representing a language, and we translate that symbol to dynamic library name and C symbol name.

Surely, you don't mean "user" as in "the person who edits a source
file"?  I presume you mean the Lisp program, not the human user.  That
Lisp program is the major mode which wants to use TS services, and the
only thing that it has in hand is its own symbol, like 'c-mode' or
'python-mode' or 'f90-mode'.  It needs a way to pass the corresponding
TS module name to TS, and my question is: how would the major mode
compute the correct module name?  We need either a mode-specific
variable with that name, or some global function that could be used by
any major mode to obtain the language module name.

> >>> BTW, since dynamic libraries has different extensions on different systems, what I want to do it to try loading the library with .so, then try .dylib, then try .dll, is that a good idea?
> >> 
> >> We can do better, see load-suffixes.
> > 
> > And in C, you can use MODULES_SUFFIX directly.  Though we will
> > probably need some minor changes there, to have the suffix defined
> > even in a build --without-modules.
> 
> I’m using tree-sitter-load-suffixes with default value ‘(“.so”, “.dylib”, “.dll”). Should I populate this variable with MODULES_SUFFIX and MODULES_SECONDARY_SUFFIX, or should I just use the two SUFFIX in C? I.e., do you see a need for users to customize suffixes?

I'd prefer a general variable shared-library-suffix(es), either a
single value specific to the target system or an alist with keys being
system names (from system-type).  Then we could use that in
load-suffixes (instead of MODULES_SUFFIX) and everywhere else.



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-12  5:39                                                                                                                                                                           ` Eli Zaretskii
@ 2021-09-13  4:15                                                                                                                                                                             ` Yuan Fu
  2021-09-13 11:47                                                                                                                                                                               ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-09-13  4:15 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Tuấn-Anh Nguyễn, Theodor Thornhill,
	Clément Pit-Claudel, Emacs developers, Stefan Monnier,
	stephen_leake



> On Sep 11, 2021, at 10:39 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: Yuan Fu <casouri@gmail.com>
>> Date: Sat, 11 Sep 2021 13:29:09 -0700
>> Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>,
>> Theodor Thornhill <theo@thornhill.no>,
>> Clément Pit-Claudel <cpitclaudel@gmail.com>,
>> Emacs developers <emacs-devel@gnu.org>,
>> Stefan Monnier <monnier@iro.umontreal.ca>,
>> stephen_leake@stephe-leake.org
>> 
>>> But the <lang> part is still needed to be concocted somehow.  E.g.,
>>> the conversion from "C#" to "c-sharp" isn't trivial.
>> 
>> The project name of tree-sitter’s C# definition is “tree-sitter-c-sharp”[1]. So if someone wants to use the C# language, they probably know what symbol represents it (we will explain the translation rule in doc-string and the manual). I also want to point out that we don’t come up with the symbols representing each language, the _user_ passes 'tree-sitter-parser-create' a symbol representing a language, and we translate that symbol to dynamic library name and C symbol name.
> 
> Surely, you don't mean "user" as in "the person who edits a source
> file"?  I presume you mean the Lisp program, not the human user.  That
> Lisp program is the major mode which wants to use TS services, and the
> only thing that it has in hand is its own symbol, like 'c-mode' or
> 'python-mode' or 'f90-mode'.  It needs a way to pass the corresponding
> TS module name to TS, and my question is: how would the major mode
> compute the correct module name?  We need either a mode-specific
> variable with that name, or some global function that could be used by
> any major mode to obtain the language module name.

Not the end-user, no. But not really “Lisp Program”, either. I mean the human being writing the major-mode and adapting the major-mode to utilize tree-sitter features. The major mode writer should be able to figure out the correct symbol to use, if she go checks out the project name for the language definition, or the package name of the language definition in her package manager, or by some other means. For example, one should be able to figure out that tree-sitter-c is the symbol for C language definition, and tree-sitter-c-sharp that C#. Then Emacs automatically translate tree-sitter-c to libtree-sitter-c.so, and tree-sitter-c-sharp to libtree-sitter-c-sharp.so; basically adding “lib” and “.so” (or “dylib” etc). If that doesn’t give the correct library name for a quirky language, the major-mode writer can add an entry to tree-sitter-library-name-override-list—(tree-sitter-quirky-lang “libtree-sitter-qlang” “tree_sitter_qlang”)—and Emacs will use that. (Or she can just use tree-sitter-qlang as the symbol, and Emacs’ auto translation would just fine.)

> 
>>>>> BTW, since dynamic libraries has different extensions on different systems, what I want to do it to try loading the library with .so, then try .dylib, then try .dll, is that a good idea?
>>>> 
>>>> We can do better, see load-suffixes.
>>> 
>>> And in C, you can use MODULES_SUFFIX directly.  Though we will
>>> probably need some minor changes there, to have the suffix defined
>>> even in a build --without-modules.
>> 
>> I’m using tree-sitter-load-suffixes with default value ‘(“.so”, “.dylib”, “.dll”). Should I populate this variable with MODULES_SUFFIX and MODULES_SECONDARY_SUFFIX, or should I just use the two SUFFIX in C? I.e., do you see a need for users to customize suffixes?
> 
> I'd prefer a general variable shared-library-suffix(es), either a
> single value specific to the target system or an alist with keys being
> system names (from system-type).  Then we could use that in
> load-suffixes (instead of MODULES_SUFFIX) and everywhere else.

To summarize, we have 

	"load-suffixes” (".elc" ".el”, with M_SUFFIX & M_SEC_SUFFIX if modules enabled), 
	"module-file-suffix” (M_SUFFIX if modules enabled), 
	"load-file-rep-suffixes” ("" ".gz"). 

All contribute to the possible file names Emacs tries when loading a file (be it a Elisp file or an Emacs module). I will add a "shared-library-suffix” specifically for loading dynamic libraries, its value will be MODULES_SUFFIX regardless if module is enabled.

Yuan





^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-13  4:15                                                                                                                                                                             ` Yuan Fu
@ 2021-09-13 11:47                                                                                                                                                                               ` Eli Zaretskii
  2021-09-13 18:01                                                                                                                                                                                 ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-09-13 11:47 UTC (permalink / raw)
  To: Yuan Fu; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake

> From: Yuan Fu <casouri@gmail.com>
> Date: Sun, 12 Sep 2021 21:15:31 -0700
> Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>,
>  Theodor Thornhill <theo@thornhill.no>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  Emacs developers <emacs-devel@gnu.org>,
>  Stefan Monnier <monnier@iro.umontreal.ca>,
>  stephen_leake@stephe-leake.org
> 
> Not the end-user, no. But not really “Lisp Program”, either. I mean the human being writing the major-mode and adapting the major-mode to utilize tree-sitter features. The major mode writer should be able to figure out the correct symbol to use, if she go checks out the project name for the language definition, or the package name of the language definition in her package manager, or by some other means. For example, one should be able to figure out that tree-sitter-c is the symbol for C language definition, and tree-sitter-c-sharp that C#. Then Emacs automatically translate tree-sitter-c to libtree-sitter-c.so, and tree-sitter-c-sharp to libtree-sitter-c-sharp.so; basically adding “lib” and “.so” (or “dylib” etc). If that doesn’t give the correct library name for a qui
 rky language, the major-mode writer can add an entry to tree-sitter-library-name-override-list—(tree-sitter-quirky-lang “libtree-sitter-qlang” “tree_sitter_qlang”)—and Emacs will use that. (Or she can just use tree-sitter-qlang as the symbol, and Emacs’ auto translation would just fine.)

It makes little sense to me to request each major mode to figure this
out.  It should IMO be a service provided by the TS integration into
Emacs.

> To summarize, we have 
> 
> 	"load-suffixes” (".elc" ".el”, with M_SUFFIX & M_SEC_SUFFIX if modules enabled), 
> 	"module-file-suffix” (M_SUFFIX if modules enabled), 
> 	"load-file-rep-suffixes” ("" ".gz"). 
> 
> All contribute to the possible file names Emacs tries when loading a file (be it a Elisp file or an Emacs module). I will add a "shared-library-suffix” specifically for loading dynamic libraries, its value will be MODULES_SUFFIX regardless if module is enabled.

Maybe the other way around: define a shared-library-suffix, and make
MODULES_SUFFIX use that if Emacs is built with modules.

Otherwise, SGTM, thanks.




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-13 11:47                                                                                                                                                                               ` Eli Zaretskii
@ 2021-09-13 18:01                                                                                                                                                                                 ` Yuan Fu
  2021-09-13 18:07                                                                                                                                                                                   ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-09-13 18:01 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Tuấn-Anh Nguyễn, Theodor Thornhill,
	Clément Pit-Claudel, Emacs developers, Stefan Monnier,
	stephen_leake



> On Sep 13, 2021, at 4:47 AM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: Yuan Fu <casouri@gmail.com>
>> Date: Sun, 12 Sep 2021 21:15:31 -0700
>> Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>,
>> Theodor Thornhill <theo@thornhill.no>,
>> Clément Pit-Claudel <cpitclaudel@gmail.com>,
>> Emacs developers <emacs-devel@gnu.org>,
>> Stefan Monnier <monnier@iro.umontreal.ca>,
>> stephen_leake@stephe-leake.org
>> 
>> Not the end-user, no. But not really “Lisp Program”, either. I mean the human being writing the major-mode and adapting the major-mode to utilize tree-sitter features. The major mode writer should be able to figure out the correct symbol to use, if she go checks out the project name for the language definition, or the package name of the language definition in her package manager, or by some other means. For example, one should be able to figure out that tree-sitter-c is the symbol for C language definition, and tree-sitter-c-sharp that C#. Then Emacs automatically translate tree-sitter-c to libtree-sitter-c.so, and tree-sitter-c-sharp to libtree-sitter-c-sharp.so; basically adding “lib” and “.so” (or “dylib” etc). If that doesn’t give the correct library name for a quirky language, the major-mode writer can add an entry to tree-sitter-library-name-override-list—(tree-sitter-quirky-lang “libtree-sitter-qlang” “tree_sitter_qlang”)—and Emacs will use that. (Or she can just use tree-sitter-qlang as the symbol, and Emacs’ auto translation would just fine.)
> 
> It makes little sense to me to request each major mode to figure this
> out.  It should IMO be a service provided by the TS integration into
> Emacs.

This is IMO the easiest and least confusing way for major-mode authors. But before we continue, what is the way you envisioned? I’m not sure what exactly is the service you want Emacs to provide.

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-13 18:01                                                                                                                                                                                 ` Yuan Fu
@ 2021-09-13 18:07                                                                                                                                                                                   ` Eli Zaretskii
  2021-09-13 18:29                                                                                                                                                                                     ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-09-13 18:07 UTC (permalink / raw)
  To: Yuan Fu; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake

> From: Yuan Fu <casouri@gmail.com>
> Date: Mon, 13 Sep 2021 11:01:47 -0700
> Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>,
>  Theodor Thornhill <theo@thornhill.no>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  Emacs developers <emacs-devel@gnu.org>,
>  Stefan Monnier <monnier@iro.umontreal.ca>,
>  stephen_leake@stephe-leake.org
> 
> > It makes little sense to me to request each major mode to figure this
> > out.  It should IMO be a service provided by the TS integration into
> > Emacs.
> 
> This is IMO the easiest and least confusing way for major-mode authors. But before we continue, what is the way you envisioned? I’m not sure what exactly is the service you want Emacs to provide.

What I had in mind is a function that give the major-mode symbol will
return the name of the corresponding TS language module (or a list of
modules, if there's more than one).



^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-13 18:07                                                                                                                                                                                   ` Eli Zaretskii
@ 2021-09-13 18:29                                                                                                                                                                                     ` Yuan Fu
  2021-09-13 18:37                                                                                                                                                                                       ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-09-13 18:29 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Tuấn-Anh Nguyễn, Theodor Thornhill,
	Clément Pit-Claudel, Emacs developers, Stefan Monnier,
	stephen_leake



> On Sep 13, 2021, at 11:07 AM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: Yuan Fu <casouri@gmail.com>
>> Date: Mon, 13 Sep 2021 11:01:47 -0700
>> Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>,
>> Theodor Thornhill <theo@thornhill.no>,
>> Clément Pit-Claudel <cpitclaudel@gmail.com>,
>> Emacs developers <emacs-devel@gnu.org>,
>> Stefan Monnier <monnier@iro.umontreal.ca>,
>> stephen_leake@stephe-leake.org
>> 
>>> It makes little sense to me to request each major mode to figure this
>>> out.  It should IMO be a service provided by the TS integration into
>>> Emacs.
>> 
>> This is IMO the easiest and least confusing way for major-mode authors. But before we continue, what is the way you envisioned? I’m not sure what exactly is the service you want Emacs to provide.
> 
> What I had in mind is a function that give the major-mode symbol will
> return the name of the corresponding TS language module (or a list of
> modules, if there's more than one).

My problem with such a function is that Emacs can’t possibly cover all the major-modes and tree-sitter languages. What if there is a new language, and someone wrote a tree-sitter language definition for it, and then want to write an Emacs major mode using tree-sitter features?

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-13 18:29                                                                                                                                                                                     ` Yuan Fu
@ 2021-09-13 18:37                                                                                                                                                                                       ` Eli Zaretskii
  2021-09-14  0:13                                                                                                                                                                                         ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-09-13 18:37 UTC (permalink / raw)
  To: Yuan Fu; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake

> From: Yuan Fu <casouri@gmail.com>
> Date: Mon, 13 Sep 2021 11:29:01 -0700
> Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>,
>  Theodor Thornhill <theo@thornhill.no>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  Emacs developers <emacs-devel@gnu.org>,
>  Stefan Monnier <monnier@iro.umontreal.ca>,
>  stephen_leake@stephe-leake.org
> 
> > What I had in mind is a function that give the major-mode symbol will
> > return the name of the corresponding TS language module (or a list of
> > modules, if there's more than one).
> 
> My problem with such a function is that Emacs can’t possibly cover all the major-modes and tree-sitter languages. What if there is a new language, and someone wrote a tree-sitter language definition for it, and then want to write an Emacs major mode using tree-sitter features?

A new major mode will extend the function to support its language(s).
the extension could be as simple as adding something to a database of
known mode-to-language associations in some alist.




^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-13 18:37                                                                                                                                                                                       ` Eli Zaretskii
@ 2021-09-14  0:13                                                                                                                                                                                         ` Yuan Fu
  2021-09-14  2:29                                                                                                                                                                                           ` Eli Zaretskii
  0 siblings, 1 reply; 284+ messages in thread
From: Yuan Fu @ 2021-09-14  0:13 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Tuấn-Anh Nguyễn, Theodor Thornhill,
	Clément Pit-Claudel, Emacs developers, Stefan Monnier,
	stephen_leake

>> 
>>> What I had in mind is a function that give the major-mode symbol will
>>> return the name of the corresponding TS language module (or a list of
>>> modules, if there's more than one).
>> 
>> My problem with such a function is that Emacs can’t possibly cover all the major-modes and tree-sitter languages. What if there is a new language, and someone wrote a tree-sitter language definition for it, and then want to write an Emacs major mode using tree-sitter features?
> 
> A new major mode will extend the function to support its language(s).
> the extension could be as simple as adding something to a database of
> known mode-to-language associations in some alist.
> 

Just to recap, we were talking about how to represent a tree-sitter language in Emacs and how to figure out the dynamic library name for that language. My plan is to use tree-sitter-<lang> to represent a language, which is usually the project name for that language definition. And we just turn it into libtree-sitter-<lang>.so/dylib/dll to get the name of the dynamic library. I think your idea has evolved into another thing—translating major-mode to the tree-sitter languages it uses could be useful, but how does it help with the original topic (representing language, translate to library name)?

Yuan


^ permalink raw reply	[flat|nested] 284+ messages in thread

* Re: Tree-sitter api
  2021-09-14  0:13                                                                                                                                                                                         ` Yuan Fu
@ 2021-09-14  2:29                                                                                                                                                                                           ` Eli Zaretskii
  2021-09-14  4:27                                                                                                                                                                                             ` Yuan Fu
  0 siblings, 1 reply; 284+ messages in thread
From: Eli Zaretskii @ 2021-09-14  2:29 UTC (permalink / raw)
  To: Yuan Fu; +Cc: ubolonton, theo, cpitclaudel, emacs-devel, monnier, stephen_leake

> From: Yuan Fu <casouri@gmail.com>
> Date: Mon, 13 Sep 2021 17:13:40 -0700
> Cc: Tuấn-Anh Nguyễn <ubolonton@gmail.com>,
>  Theodor Thornhill <theo@thornhill.no>,
>  Clément Pit-Claudel