bug#61369: Problem with keeping tree-sitter parse tree up-to-date

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* bug#61369: Problem with keeping tree-sitter parse tree up-to-date
@ 2023-02-08 15:34 Dmitry Gutov
  2023-02-08 18:20 ` Theodor Thornhill via Bug reports for GNU Emacs, the Swiss army knife of text editors
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Dmitry Gutov @ 2023-02-08 15:34 UTC (permalink / raw)
  To: 61369

1. Have some buffer with text

"use std::Path::{self, Path, PathBuf};  // good: std is a crate name
... (maybe something else
"

2. Have this text in a different buffer (I used scratch):

"


let date = DateTime::<chrono::Utc>::from_utc(date, chrono::Utc);
"

Meaning, the buffer starts with two empty lines.

3. Select the first line from the first buffer including the trailing 
newline, yank and then paste at the beginning of the second buffer.

The second buffer will now look like this:

"use std::Path::{self, Path, PathBuf};  // good: std is a crate name


let date = DateTime::<chrono::Utc>::from_utc(date, chrono::Utc);
"

The parse tree will, however, be (according to treesit-explore-mode):

(source_file
  (use_declaration use
   argument:
    (scoped_use_list
     path: (scoped_identifier path: (identifier) :: name: (identifier))
     ::
     list: (use_list { (self) , (identifier) , (identifier) }))
   ;)
  (line_comment)
  (let_declaration let pattern: (identifier) =
   value:
    (scoped_identifier
     path:
      (generic_type type: (type_identifier) ::
       type_arguments:
        (type_arguments <
         (scoped_type_identifier path: (identifier) :: name: 
(type_identifier))
         >))
     :: name: (identifier))
   (ERROR ( (identifier) ,
    (scoped_identifier path: (identifier) :: name: (identifier))
    ;)
   ;))

But if I edit the buffer after that, e.g. add and remove a space at the 
beginning, the error node disappears:

(source_file
  (use_declaration use
   argument:
    (scoped_use_list
     path: (scoped_identifier path: (identifier) :: name: (identifier))
     ::
     list: (use_list { (self) , (identifier) , (identifier) }))
   ;)
  (line_comment)
  (let_declaration let pattern: (identifier) =
   value:
    (call_expression
     function:
      (scoped_identifier
       path:
        (generic_type type: (type_identifier) ::
         type_arguments:
          (type_arguments <
           (scoped_type_identifier path: (identifier) :: name: 
(type_identifier))
           >))
       :: name: (identifier))
     arguments:
      (arguments ( (identifier) ,
       (scoped_identifier path: (identifier) :: name: (identifier))
       )))
   ;))





^ permalink raw reply	[flat|nested] 16+ messages in thread

* bug#61369: Problem with keeping tree-sitter parse tree up-to-date
  2023-02-08 15:34 bug#61369: Problem with keeping tree-sitter parse tree up-to-date Dmitry Gutov
@ 2023-02-08 18:20 ` Theodor Thornhill via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2023-02-08 19:41   ` Dmitry Gutov
  2023-02-10  1:22 ` Yuan Fu
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Theodor Thornhill via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-02-08 18:20 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: 61369

Dmitry Gutov <dgutov@yandex.ru> writes:

> 1. Have some buffer with text
>
> "use std::Path::{self, Path, PathBuf};  // good: std is a crate name
> ... (maybe something else
> "
>
> 2. Have this text in a different buffer (I used scratch):
>
> "
>
>
> let date = DateTime::<chrono::Utc>::from_utc(date, chrono::Utc);
> "
>
> Meaning, the buffer starts with two empty lines.
>
> 3. Select the first line from the first buffer including the trailing newline,
> yank and then paste at the beginning of the second buffer.
>
> The second buffer will now look like this:
>
> "use std::Path::{self, Path, PathBuf};  // good: std is a crate name
>
>
> let date = DateTime::<chrono::Utc>::from_utc(date, chrono::Utc);
> "

I just want to confirm that I can reproduce this, and that if you skip
the trailing newline from the use-statement, I don't get this behavior.
So it seems like the newline is the crucial point, right?

Theo





^ permalink raw reply	[flat|nested] 16+ messages in thread

* bug#61369: Problem with keeping tree-sitter parse tree up-to-date
  2023-02-08 18:20 ` Theodor Thornhill via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2023-02-08 19:41   ` Dmitry Gutov
  0 siblings, 0 replies; 16+ messages in thread
From: Dmitry Gutov @ 2023-02-08 19:41 UTC (permalink / raw)
  To: Theodor Thornhill; +Cc: 61369@debbugs.gnu.org

[-- Attachment #1: Type: text/html, Size: 1628 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* bug#61369: Problem with keeping tree-sitter parse tree up-to-date
  2023-02-08 15:34 bug#61369: Problem with keeping tree-sitter parse tree up-to-date Dmitry Gutov
  2023-02-08 18:20 ` Theodor Thornhill via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2023-02-10  1:22 ` Yuan Fu
  2023-02-10  1:38   ` Dmitry Gutov
  2023-02-13  9:10 ` Yuan Fu
  2023-02-13 23:59 ` Yuan Fu
  3 siblings, 1 reply; 16+ messages in thread
From: Yuan Fu @ 2023-02-10  1:22 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: theo, 61369


>  I just want to confirm that I can reproduce this, and that if you skip
>  the trailing newline from the use-statement, I don't get this behavior.
>  So it seems like the newline is the crucial point, right?
>
> Yes, same.
>
> Thr trailing newline is necessary.
>
> The empty lines at the beginning of the buffer (being copied to) are necessary to reproduce this as well.

Hmmm, it might be related to how does tree-sitter does incremental
parsing? If the newline is necessary, then I guess it’s not because
Emacs missed characters when reporting edits to tree-sitter.

Yuan





^ permalink raw reply	[flat|nested] 16+ messages in thread

* bug#61369: Problem with keeping tree-sitter parse tree up-to-date
  2023-02-10  1:22 ` Yuan Fu
@ 2023-02-10  1:38   ` Dmitry Gutov
  0 siblings, 0 replies; 16+ messages in thread
From: Dmitry Gutov @ 2023-02-10  1:38 UTC (permalink / raw)
  To: Yuan Fu; +Cc: theo, 61369

On 10/02/2023 03:22, Yuan Fu wrote:
>>   I just want to confirm that I can reproduce this, and that if you skip
>>   the trailing newline from the use-statement, I don't get this behavior.
>>   So it seems like the newline is the crucial point, right?
>>
>> Yes, same.
>>
>> Thr trailing newline is necessary.
>>
>> The empty lines at the beginning of the buffer (being copied to) are necessary to reproduce this as well.
> Hmmm, it might be related to how does tree-sitter does incremental
> parsing? If the newline is necessary, then I guess it’s not because
> Emacs missed characters when reporting edits to tree-sitter.

The newline is somewhat necessary: the scenario doesn't work, for 
example, if the pasted text doesn't include the newline but the buffer 
had an additional (third) one at the top.

But the scenario also doesn't work if some other (any) character is 
removed from the yanked line before pasting: it could be even one after 
the comment instruction (//).

OTOH, if I add an extra char to the yanked line, anywhere, I can skip 
the newline. E.g. I can paste

   use std::path::{self, Path, PathBuf};  // good: std is a crate namee

without a newline and still see the exact same syntax error.

So it looks more like an off-by-one error somewhere. Maybe in our code, 
but maybe in tree-sitter somewhere.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* bug#61369: Problem with keeping tree-sitter parse tree up-to-date
  2023-02-08 15:34 bug#61369: Problem with keeping tree-sitter parse tree up-to-date Dmitry Gutov
  2023-02-08 18:20 ` Theodor Thornhill via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2023-02-10  1:22 ` Yuan Fu
@ 2023-02-13  9:10 ` Yuan Fu
  2023-02-13 23:59 ` Yuan Fu
  3 siblings, 0 replies; 16+ messages in thread
From: Yuan Fu @ 2023-02-13  9:10 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: theo, 61369


Dmitry Gutov <dgutov@yandex.ru> writes:

> On 10/02/2023 03:22, Yuan Fu wrote:
>>>   I just want to confirm that I can reproduce this, and that if you skip
>>>   the trailing newline from the use-statement, I don't get this behavior.
>>>   So it seems like the newline is the crucial point, right?
>>>
>>> Yes, same.
>>>
>>> Thr trailing newline is necessary.
>>>
>>> The empty lines at the beginning of the buffer (being copied to) are necessary to reproduce this as well.
>> Hmmm, it might be related to how does tree-sitter does incremental
>> parsing? If the newline is necessary, then I guess it’s not because
>> Emacs missed characters when reporting edits to tree-sitter.
>
> The newline is somewhat necessary: the scenario doesn't work, for
> example, if the pasted text doesn't include the newline but the buffer
> had an additional (third) one at the top.
>
> But the scenario also doesn't work if some other (any) character is
> removed from the yanked line before pasting: it could be even one
> after the comment instruction (//).
>
> OTOH, if I add an extra char to the yanked line, anywhere, I can skip
> the newline. E.g. I can paste
>
>   use std::path::{self, Path, PathBuf};  // good: std is a crate namee
>
> without a newline and still see the exact same syntax error.
>
> So it looks more like an off-by-one error somewhere. Maybe in our
> code, but maybe in tree-sitter somewhere.

Some progress report: I added a function that reads the buffer like a
parser would, like this:

DEFUN ("treesit--parser-view",
       Ftreesit__parser_view,
       Streesit__parser_view, 1, 1, 0,
       doc: /* Return the view of PARSER.
Read buffer like PARSER would into a string and return it.  */)
  (Lisp_Object parser)
{
  const ptrdiff_t visible_beg = XTS_PARSER (parser)->visible_beg;
  const ptrdiff_t visible_end = XTS_PARSER (parser)->visible_end;
  const ptrdiff_t view_len = visible_end - visible_beg;

  char *str_buf = xzalloc (view_len + 1);
  uint32_t read = 0;
  TSPoint pos = { 0 };
  for (int idx = 0; idx < view_len; idx++)
    {
      const char *ch = treesit_read_buffer (XTS_PARSER (parser),
					    idx, pos, &read);
      if (read == 0)
	{
	  xfree (str_buf);
	  xsignal1 (Qtreesit_error, make_fixnum (idx));
	}
      else
	str_buf[idx] = *ch;
    }
  Lisp_Object ret_str = make_string (str_buf, view_len);
  xfree (str_buf);
  return ret_str;
}

After I follow the steps and got the error node, I run this function on
the parser, and the returned string looks good.

Next I’ll try to log every character actually read by the parser and see
if anything seems fishy.

Yuan





^ permalink raw reply	[flat|nested] 16+ messages in thread

* bug#61369: Problem with keeping tree-sitter parse tree up-to-date
  2023-02-08 15:34 bug#61369: Problem with keeping tree-sitter parse tree up-to-date Dmitry Gutov
                   ` (2 preceding siblings ...)
  2023-02-13  9:10 ` Yuan Fu
@ 2023-02-13 23:59 ` Yuan Fu
  2023-02-15  2:17   ` Dmitry Gutov
  3 siblings, 1 reply; 16+ messages in thread
From: Yuan Fu @ 2023-02-13 23:59 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: theo, 61369

[-- Attachment #1: Type: text/plain, Size: 5401 bytes --]


Yuan Fu <casouri@gmail.com> writes:

> Dmitry Gutov <dgutov@yandex.ru> writes:
>
>> On 10/02/2023 03:22, Yuan Fu wrote:
>>>>   I just want to confirm that I can reproduce this, and that if you skip
>>>>   the trailing newline from the use-statement, I don't get this behavior.
>>>>   So it seems like the newline is the crucial point, right?
>>>>
>>>> Yes, same.
>>>>
>>>> Thr trailing newline is necessary.
>>>>
>>>> The empty lines at the beginning of the buffer (being copied to) are necessary to reproduce this as well.
>>> Hmmm, it might be related to how does tree-sitter does incremental
>>> parsing? If the newline is necessary, then I guess it’s not because
>>> Emacs missed characters when reporting edits to tree-sitter.
>>
>> The newline is somewhat necessary: the scenario doesn't work, for
>> example, if the pasted text doesn't include the newline but the buffer
>> had an additional (third) one at the top.
>>
>> But the scenario also doesn't work if some other (any) character is
>> removed from the yanked line before pasting: it could be even one
>> after the comment instruction (//).
>>
>> OTOH, if I add an extra char to the yanked line, anywhere, I can skip
>> the newline. E.g. I can paste
>>
>>   use std::path::{self, Path, PathBuf};  // good: std is a crate namee
>>
>> without a newline and still see the exact same syntax error.
>>
>> So it looks more like an off-by-one error somewhere. Maybe in our
>> code, but maybe in tree-sitter somewhere.
>
> Some progress report: I added a function that reads the buffer like a
> parser would, like this:
>
> DEFUN ("treesit--parser-view",
>        Ftreesit__parser_view,
>        Streesit__parser_view, 1, 1, 0,
>        doc: /* Return the view of PARSER.
> Read buffer like PARSER would into a string and return it.  */)
>   (Lisp_Object parser)
> {
>   const ptrdiff_t visible_beg = XTS_PARSER (parser)->visible_beg;
>   const ptrdiff_t visible_end = XTS_PARSER (parser)->visible_end;
>   const ptrdiff_t view_len = visible_end - visible_beg;
>
>   char *str_buf = xzalloc (view_len + 1);
>   uint32_t read = 0;
>   TSPoint pos = { 0 };
>   for (int idx = 0; idx < view_len; idx++)
>     {
>       const char *ch = treesit_read_buffer (XTS_PARSER (parser),
> 					    idx, pos, &read);
>       if (read == 0)
> 	{
> 	  xfree (str_buf);
> 	  xsignal1 (Qtreesit_error, make_fixnum (idx));
> 	}
>       else
> 	str_buf[idx] = *ch;
>     }
>   Lisp_Object ret_str = make_string (str_buf, view_len);
>   xfree (str_buf);
>   return ret_str;
> }
>
> After I follow the steps and got the error node, I run this function on
> the parser, and the returned string looks good.
>
> Next I’ll try to log every character actually read by the parser and see
> if anything seems fishy.

I don’t know if it’s good news or bad news, but it doesn’t seem like a
off-by-one. Here is what I did:

1. I applied the attached patch (patch.diff) so that treesit_read_buffer, the
function used by tree-sitter parser to read buffer contents, prints the
position it read and the character it gets to stdout.

2. I open test.rs which contains

"

let date = DateTime::<chrono::Utc>::from_utc(date, chrono::Utc);
"

as in the recipe. I have rust-ts-mode enabled, so Emacs prints the
characters read by the parser to stdout. I type return several times to
separate this first batch of output from the next, which is what I’m
interested in.

3. I paste

"use std::Path::{self, Path, PathBuf};  // good: std is a crate name
"

at the beginning of the buffer. Now the parse tree contains that error
node. I go to the terminal, copy the output out, which looks like:

0 117
1 115
2 101
3 32
0 117
1 115
2 101
...
133 59
134 10
134 10
134 10
134 10

4. I paste this output (output.txt) into a buffer, and reconstruct the text read by
the parser with (setq str (reconstruct)), where reconstruct is:

(defun reconstruct ()
  (goto-char (point-min))
  (let ((result ""))
    (while (< (point) (point-max))
      (let* ((str (buffer-substring (point) (line-end-position)))
             (nums (string-split str))
             (pos (string-to-number (car nums)))
             (char (string-to-number (cadr nums))))
        (when (not (< pos (length result)))
          (setq result (concat result
                               (make-string (- (1+ pos) (length result))
                                            ?0))))
        (setf (aref result pos) char))
      (forward-line 1))
    result))

5. I insert str into a new buffer, and (to my disappointment) the
content is identical to the buffer text.

There are two surprises here: 1) there isn’t an off-by-one bug, 2) the
parser actually read the whole buffer, rather than reading only the new
content. Then there are even less reason for it to create that error
node.

In addition, I inserted a new line in the Rust source buffer (test.rs) (which
fixes the error node), here is what the parser read after that
insertion:

"0000000000000000000000000000000000000000000000000000000000000000000



let 0000 = 000000000000000000000000000000000000000000000000000);"

0 means it didn’t read that position, we can see that the parser read
all the newlines, "let ", " = ", and ");". I can’t discern anything
interesting from that, tho.

Yuan


[-- Attachment #2: output.txt --]
[-- Type: text/plain, Size: 1375 bytes --]

0 117
1 115
2 101
3 32
0 117
1 115
2 101
3 32
4 115
3 32
4 115
5 116
6 100
7 58
4 115
5 116
6 100
7 58
8 58
9 80
10 97
11 116
12 104
13 58
9 80
13 58
14 58
15 123
16 115
17 101
18 108
19 102
20 44
16 115
17 101
18 108
19 102
20 44
21 32
22 80
21 32
22 80
23 97
24 116
25 104
26 44
22 80
26 44
27 32
28 80
27 32
28 80
29 97
30 116
31 104
32 66
33 117
34 102
35 125
28 80
35 125
36 59
37 32
38 32
39 47
40 47
37 32
38 32
39 47
40 47
41 32
42 103
43 111
44 111
45 100
46 58
47 32
48 115
49 116
50 100
51 32
52 105
53 115
54 32
55 97
56 32
57 99
58 114
59 97
60 116
61 101
62 32
63 110
64 97
65 109
66 101
67 10
68 10
69 10
70 108
67 10
68 10
69 10
70 108
71 101
72 116
73 32
70 108
71 101
72 116
73 32
74 100
73 32
74 100
75 97
76 116
77 101
78 32
74 100
75 97
78 32
79 61
78 32
79 61
80 32
81 68
80 32
81 68
82 97
83 116
84 101
85 84
86 105
87 109
88 101
89 58
81 68
89 58
90 58
91 60
92 99
93 104
94 114
95 111
96 110
97 111
98 58
92 99
93 104
94 114
98 58
99 58
100 85
101 116
102 99
103 62
100 85
103 62
104 58
105 58
106 102
107 114
108 111
109 109
110 95
111 117
112 116
113 99
114 40
106 102
107 114
114 40
115 100
116 97
117 116
118 101
119 44
115 100
116 97
119 44
120 32
121 99
120 32
121 99
122 104
123 114
124 111
125 110
126 111
127 58
121 99
122 104
123 114
127 58
128 58
129 85
130 116
131 99
132 41
129 85
133 59
134 10
132 41
133 59
134 10
134 10
134 10
134 10

[-- Attachment #3: patch.diff --]
[-- Type: application/octet-stream, Size: 1767 bytes --]

diff --git a/src/treesit.c b/src/treesit.c
index cab2f0d5354..ad87a6ae759 100644
--- a/src/treesit.c
+++ b/src/treesit.c
@@ -1101,6 +1101,13 @@ treesit_read_buffer (void *parser, uint32_t byte_index,
      assertion should never hit.  */
   eassert (len < UINT32_MAX);
   *bytes_read = (uint32_t) len;
+
+  if (*bytes_read > 0)
+    {
+      printf ("%d %d\n", byte_index, *beg);
+      fflush (stdout);
+    }
+
   return beg;
 }
 
@@ -3432,6 +3439,37 @@ DEFUN ("treesit-subtree-stat",
     }
 }
 
+DEFUN ("treesit--parser-view",
+       Ftreesit__parser_view,
+       Streesit__parser_view, 1, 1, 0,
+       doc: /* Return the view of PARSER.
+Read buffer like PARSER would into a string and return it.  */)
+  (Lisp_Object parser)
+{
+  const ptrdiff_t visible_beg = XTS_PARSER (parser)->visible_beg;
+  const ptrdiff_t visible_end = XTS_PARSER (parser)->visible_end;
+  const ptrdiff_t view_len = visible_end - visible_beg;
+
+  char *str_buf = xzalloc (view_len + 1);
+  uint32_t read = 0;
+  TSPoint pos = { 0 };
+  for (int idx = 0; idx < view_len; idx++)
+    {
+      const char *ch = treesit_read_buffer (XTS_PARSER (parser),
+					    idx, pos, &read);
+      if (read == 0)
+	{
+	  xfree (str_buf);
+	  xsignal1 (Qtreesit_error, make_fixnum (idx));
+	}
+      else
+	str_buf[idx] = *ch;
+    }
+  Lisp_Object ret_str = make_string (str_buf, view_len);
+  xfree (str_buf);
+  return ret_str;
+}
+
 #endif	/* HAVE_TREE_SITTER */
 
 DEFUN ("treesit-available-p", Ftreesit_available_p,
@@ -3633,6 +3671,8 @@ syms_of_treesit (void)
   defsubr (&Streesit_search_forward);
   defsubr (&Streesit_induce_sparse_tree);
   defsubr (&Streesit_subtree_stat);
+
+  defsubr (&Streesit__parser_view);
 #endif /* HAVE_TREE_SITTER */
   defsubr (&Streesit_available_p);
 }

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* bug#61369: Problem with keeping tree-sitter parse tree up-to-date
  2023-02-13 23:59 ` Yuan Fu
@ 2023-02-15  2:17   ` Dmitry Gutov
  2023-02-15 22:44     ` Dmitry Gutov
  0 siblings, 1 reply; 16+ messages in thread
From: Dmitry Gutov @ 2023-02-15  2:17 UTC (permalink / raw)
  To: Yuan Fu; +Cc: theo, 61369

On 14/02/2023 01:59, Yuan Fu wrote:
> There are two surprises here: 1) there isn’t an off-by-one bug, 2) the
> parser actually read the whole buffer, rather than reading only the new
> content. Then there are even less reason for it to create that error
> node.

The parser reads the whole buffer, but if it tries to reparse based on 
the previous parse tree with incorrect positions, it might get into an 
invalid state as a result.

I've tried gdb-ing treesit_tree_edit_1 (after dropping the 'inline' 
qualifier), and here's what I see:

- If I paste the test line without the trailing newline or not, the value.

- If I paste the test line with the trailing newline, the value of 
new_end_byte is still 67. But then it is followed by this call right away:

Thread 1 "emacs" hit Breakpoint 3, treesit_tree_edit_1 
(tree=tree@entry=0x5555574139b0, start_byte=start_byte@entry=134, 
old_end_byte=old_end_byte@entry=134, new_end_byte=135) at treesit.c:739

- If I 'undo' after that, the call is as expected:

Thread 1 "emacs" hit Breakpoint 3, treesit_tree_edit_1 
(tree=0x555557435cd0, start_byte=start_byte@entry=0, 
old_end_byte=old_end_byte@entry=68, new_end_byte=new_end_byte@entry=0) 
at treesit.c:739
739	{

So I tried again to figure out the odd call, with the backtrace:

Thread 1 "emacs" hit Breakpoint 3, treesit_tree_edit_1 
(tree=tree@entry=0x5555575b64f0, start_byte=start_byte@entry=134, 
old_end_byte=old_end_byte@entry=134, new_end_byte=269) at treesit.c:739
739	{
(gdb) backtrace
#0  treesit_tree_edit_1 (tree=tree@entry=0x5555575b64f0, 
start_byte=start_byte@entry=134, old_end_byte=old_end_byte@entry=134, 
new_end_byte=269) at treesit.c:739
#1  0x00005555557cb085 in treesit_sync_visible_region 
(parser=parser@entry=XIL(0x555556fc329d)) at treesit.c:931
#2  0x00005555557ccf28 in treesit_ensure_parsed 
(parser=XIL(0x555556fc329d)) at treesit.c:1025
#3  Ftreesit_parser_root_node (parser=XIL(0x555556fc329d)) at treesit.c:1507

treesit.c:739 points to a treesit_tree_edit_1 call which is predicated 
on this condition:

   if (visible_end < BUF_ZV_BYTE (buffer))

...which shouldn't be the case since the buffer is small enough to fit 
in the default window. It might already be the consequence of passing 
the wrong value of new_end_byte to ts_tree_edit, though.

Going back to the first call, the backtrace looks like this:

Thread 1 "emacs" hit Breakpoint 3, treesit_tree_edit_1 
(tree=0x5555574f0ff0, start_byte=start_byte@entry=0, 
old_end_byte=old_end_byte@entry=0, new_end_byte=new_end_byte@entry=67) 
at treesit.c:739
739	{
(gdb) backtrace
#0  treesit_tree_edit_1 (tree=0x5555574f0ff0, 
start_byte=start_byte@entry=0, old_end_byte=old_end_byte@entry=0, 
new_end_byte=new_end_byte@entry=67) at treesit.c:739
#1  0x00005555557cc991 in treesit_record_change (start_byte=1, 
old_end_byte=1, new_end_byte=69) at treesit.c:806
#2  0x00005555556f8bb7 in insert_from_string_1 
(string=XIL(0x55555744c4f4), pos=0, pos_byte=0, nchars=68, nbytes=68, 
inherit=<optimized out>, before_markers=false) at insdel.c:1084

Seems like treesit_record_change turns new_end_byte=69 into 
new_end_byte=67 inside treesit_tree_edit_1.

It seems to fail in this calculation:

   ptrdiff_t new_end_offset = (min (visible_end,
				   max (visible_end, new_end_byte))
			      - visible_beg);

because visible_end is still 68 there. It value gets updated later, 
closer to the end of this function.





^ permalink raw reply	[flat|nested] 16+ messages in thread

* bug#61369: Problem with keeping tree-sitter parse tree up-to-date
  2023-02-15  2:17   ` Dmitry Gutov
@ 2023-02-15 22:44     ` Dmitry Gutov
  2023-02-17 22:32       ` Yuan Fu
  0 siblings, 1 reply; 16+ messages in thread
From: Dmitry Gutov @ 2023-02-15 22:44 UTC (permalink / raw)
  To: Yuan Fu; +Cc: theo, 61369

On 15/02/2023 04:17, Dmitry Gutov wrote:
> Seems like treesit_record_change turns new_end_byte=69 into 
> new_end_byte=67 inside treesit_tree_edit_1.
> 
> It seems to fail in this calculation:
> 
>    ptrdiff_t new_end_offset = (min (visible_end,
>                     max (visible_end, new_end_byte))
>                    - visible_beg);
> 
> because visible_end is still 68 there. It value gets updated later, 
> closer to the end of this function.

So FWIW the patch below fixes the problem. But I'm not sure about 
change's clipping by the current restriction, or how it should be 
handled exactly.

diff --git a/src/treesit.c b/src/treesit.c
index cab2f0d5354..9f15b88a8bd 100644
--- a/src/treesit.c
+++ b/src/treesit.c
@@ -794,9 +794,7 @@ treesit_record_change (ptrdiff_t start_byte, 
ptrdiff_t old_end_byte,
  	  ptrdiff_t old_end_offset = (min (visible_end,
  					   max (visible_beg, old_end_byte))
  				      - visible_beg);
-	  ptrdiff_t new_end_offset = (min (visible_end,
-					   max (visible_beg, new_end_byte))
-				      - visible_beg);
+	  ptrdiff_t new_end_offset = max (visible_beg, new_end_byte) - 
visible_beg;
  	  eassert (start_offset <= old_end_offset);
  	  eassert (start_offset <= new_end_offset);







^ permalink raw reply related	[flat|nested] 16+ messages in thread

* bug#61369: Problem with keeping tree-sitter parse tree up-to-date
  2023-02-15 22:44     ` Dmitry Gutov
@ 2023-02-17 22:32       ` Yuan Fu
  2023-02-18  0:11         ` Dmitry Gutov
  0 siblings, 1 reply; 16+ messages in thread
From: Yuan Fu @ 2023-02-17 22:32 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: Theodor Thornhill, 61369



> On Feb 15, 2023, at 2:44 PM, Dmitry Gutov <dgutov@yandex.ru> wrote:
> 
> On 15/02/2023 04:17, Dmitry Gutov wrote:
>> Seems like treesit_record_change turns new_end_byte=69 into new_end_byte=67 inside treesit_tree_edit_1.
>> It seems to fail in this calculation:
>>   ptrdiff_t new_end_offset = (min (visible_end,
>>                    max (visible_end, new_end_byte))
>>                   - visible_beg);
>> because visible_end is still 68 there. It value gets updated later, closer to the end of this function.
> 
> So FWIW the patch below fixes the problem. But I'm not sure about change's clipping by the current restriction, or how it should be handled exactly.
> 
> diff --git a/src/treesit.c b/src/treesit.c
> index cab2f0d5354..9f15b88a8bd 100644
> --- a/src/treesit.c
> +++ b/src/treesit.c
> @@ -794,9 +794,7 @@ treesit_record_change (ptrdiff_t start_byte, ptrdiff_t old_end_byte,
>  ptrdiff_t old_end_offset = (min (visible_end,
>   max (visible_beg, old_end_byte))
>      - visible_beg);
> -  ptrdiff_t new_end_offset = (min (visible_end,
> -   max (visible_beg, new_end_byte))
> -      - visible_beg);
> +  ptrdiff_t new_end_offset = max (visible_beg, new_end_byte) - visible_beg;
>  eassert (start_offset <= old_end_offset);
>  eassert (start_offset <= new_end_offset);

Thank you very much! I thought that clipping the change into the fixed visible range, and rely on treesit_sync_visible_region to add back the clipped “tail” when we extend the visible range would be equivalent to not clipping, but I guess clipping and re-adding affects how incremental parsing works inside tree-sitter.

I don’t think this change would have any adverse effect, because if you think of it, inserting text in a narrowed region always extends the region, rather than pushing text at the end out of the narrowed region. So the right thing to do here is in fact not clipping new_end_offset.

I pushed this change.

Yuan




^ permalink raw reply	[flat|nested] 16+ messages in thread

* bug#61369: Problem with keeping tree-sitter parse tree up-to-date
  2023-02-17 22:32       ` Yuan Fu
@ 2023-02-18  0:11         ` Dmitry Gutov
  2023-02-18  1:14           ` Yuan Fu
  2023-02-18  7:15           ` Eli Zaretskii
  0 siblings, 2 replies; 16+ messages in thread
From: Dmitry Gutov @ 2023-02-18  0:11 UTC (permalink / raw)
  To: Yuan Fu; +Cc: Theodor Thornhill, 61369

On 18/02/2023 00:32, Yuan Fu wrote:
> Thank you very much! I thought that clipping the change into the fixed visible range, and rely on treesit_sync_visible_region to add back the clipped “tail” when we extend the visible range would be equivalent to not clipping, but I guess clipping and re-adding affects how incremental parsing works inside tree-sitter.

It seems like the "repairing" sync used a different range, one that 
didn't include the character number 68 inserted from the beginning.

It just synced the 1 or 2 characters at the end of the buffer, the 
difference between the computed visible_end and the actual BUF_ZV_BYTE.

> I don’t think this change would have any adverse effect, because if you think of it, inserting text in a narrowed region always extends the region, rather than pushing text at the end out of the narrowed region. So the right thing to do here is in fact not clipping new_end_offset.

I figured it could be a problem if both old_end_byte and new_end_byte 
extend past the current restriction.

But I'm not sure whether that could actually happen in practice. The 
obvious attempts (undo a change outside of the narrowing, or revert the 
buffer when narrowing is in effect) didn't play out, but I'm not sure 
whether there is an actual hard limit on modifying the text outside of 
the current restriction.

> I pushed this change.

Thanks. Good to see it make it in.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* bug#61369: Problem with keeping tree-sitter parse tree up-to-date
  2023-02-18  0:11         ` Dmitry Gutov
@ 2023-02-18  1:14           ` Yuan Fu
  2023-02-18  1:25             ` Dmitry Gutov
  2023-02-18  7:15           ` Eli Zaretskii
  1 sibling, 1 reply; 16+ messages in thread
From: Yuan Fu @ 2023-02-18  1:14 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: Theodor Thornhill, 61369



> On Feb 17, 2023, at 4:11 PM, Dmitry Gutov <dgutov@yandex.ru> wrote:
> 
> On 18/02/2023 00:32, Yuan Fu wrote:
>> Thank you very much! I thought that clipping the change into the fixed visible range, and rely on treesit_sync_visible_region to add back the clipped “tail” when we extend the visible range would be equivalent to not clipping, but I guess clipping and re-adding affects how incremental parsing works inside tree-sitter.
> 
> It seems like the "repairing" sync used a different range, one that didn't include the character number 68 inserted from the beginning.
> 
> It just synced the 1 or 2 characters at the end of the buffer, the difference between the computed visible_end and the actual BUF_ZV_BYTE.

That should be enough, no? Because other text didn’t change, they just moved. And tree-sitter should know that they moved. Or maybe I’m misunderstanding what you mean.

> 
>> I don’t think this change would have any adverse effect, because if you think of it, inserting text in a narrowed region always extends the region, rather than pushing text at the end out of the narrowed region. So the right thing to do here is in fact not clipping new_end_offset.
> 
> I figured it could be a problem if both old_end_byte and new_end_byte extend past the current restriction.

That should be fine (ie, technically correct), since when we widen, the clipped text are reparsed by tree-sitter as new text.

> 
> But I'm not sure whether that could actually happen in practice. The obvious attempts (undo a change outside of the narrowing, or revert the buffer when narrowing is in effect) didn't play out, but I'm not sure whether there is an actual hard limit on modifying the text outside of the current restriction.

It is my impression that Emacs in general enforces the narrowing restriction strictly. And we are still correct when exceptions occur.

Yuan




^ permalink raw reply	[flat|nested] 16+ messages in thread

* bug#61369: Problem with keeping tree-sitter parse tree up-to-date
  2023-02-18  1:14           ` Yuan Fu
@ 2023-02-18  1:25             ` Dmitry Gutov
  2023-02-18 10:05               ` Yuan Fu
  0 siblings, 1 reply; 16+ messages in thread
From: Dmitry Gutov @ 2023-02-18  1:25 UTC (permalink / raw)
  To: Yuan Fu; +Cc: Theodor Thornhill, 61369

On 18/02/2023 03:14, Yuan Fu wrote:
> 
> 
>> On Feb 17, 2023, at 4:11 PM, Dmitry Gutov <dgutov@yandex.ru> wrote:
>>
>> On 18/02/2023 00:32, Yuan Fu wrote:
>>> Thank you very much! I thought that clipping the change into the fixed visible range, and rely on treesit_sync_visible_region to add back the clipped “tail” when we extend the visible range would be equivalent to not clipping, but I guess clipping and re-adding affects how incremental parsing works inside tree-sitter.
>>
>> It seems like the "repairing" sync used a different range, one that didn't include the character number 68 inserted from the beginning.
>>
>> It just synced the 1 or 2 characters at the end of the buffer, the difference between the computed visible_end and the actual BUF_ZV_BYTE.
> 
> That should be enough, no? Because other text didn’t change, they just moved. And tree-sitter should know that they moved. Or maybe I’m misunderstanding what you mean.

But the "unsynced" character is at position 68.

And we just tell tree-sitter to update positions 134-136. So it stays 
ignorant of the changed char in the middle of the buffer.

It's not just about not knowing about the change either (the character 
in question is a newline, so its absence wouldn't lead to a syntax 
error), but about wrong offsets in the old parse tree, based on which 
the new tree is generated. That probably creates a wrong picture of the 
source text in the parser.

>>> I don’t think this change would have any adverse effect, because if you think of it, inserting text in a narrowed region always extends the region, rather than pushing text at the end out of the narrowed region. So the right thing to do here is in fact not clipping new_end_offset.
>>
>> I figured it could be a problem if both old_end_byte and new_end_byte extend past the current restriction.
> 
> That should be fine (ie, technically correct), since when we widen, the clipped text are reparsed by tree-sitter as new text.

I guess the effect I was thinking of is that

   XTS_PARSER (lisp_parser)->visible_end

would end up with a higher value than BUF_ZV_BYTE. Not sure if it's a 
problem.

>>
>> But I'm not sure whether that could actually happen in practice. The obvious attempts (undo a change outside of the narrowing, or revert the buffer when narrowing is in effect) didn't play out, but I'm not sure whether there is an actual hard limit on modifying the text outside of the current restriction.
> 
> It is my impression that Emacs in general enforces the narrowing restriction strictly. And we are still correct when exceptions occur.

Very good.






^ permalink raw reply	[flat|nested] 16+ messages in thread

* bug#61369: Problem with keeping tree-sitter parse tree up-to-date
  2023-02-18  1:25             ` Dmitry Gutov
@ 2023-02-18 10:05               ` Yuan Fu
  0 siblings, 0 replies; 16+ messages in thread
From: Yuan Fu @ 2023-02-18 10:05 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: Theodor Thornhill, 61369



> On Feb 17, 2023, at 5:25 PM, Dmitry Gutov <dgutov@yandex.ru> wrote:
> 
> On 18/02/2023 03:14, Yuan Fu wrote:
>>> On Feb 17, 2023, at 4:11 PM, Dmitry Gutov <dgutov@yandex.ru> wrote:
>>> 
>>> On 18/02/2023 00:32, Yuan Fu wrote:
>>>> Thank you very much! I thought that clipping the change into the fixed visible range, and rely on treesit_sync_visible_region to add back the clipped “tail” when we extend the visible range would be equivalent to not clipping, but I guess clipping and re-adding affects how incremental parsing works inside tree-sitter.
>>> 
>>> It seems like the "repairing" sync used a different range, one that didn't include the character number 68 inserted from the beginning.
>>> 
>>> It just synced the 1 or 2 characters at the end of the buffer, the difference between the computed visible_end and the actual BUF_ZV_BYTE.
>> That should be enough, no? Because other text didn’t change, they just moved. And tree-sitter should know that they moved. Or maybe I’m misunderstanding what you mean.
> 
> But the "unsynced" character is at position 68.
> 
> And we just tell tree-sitter to update positions 134-136. So it stays ignorant of the changed char in the middle of the buffer.
> 
> It's not just about not knowing about the change either (the character in question is a newline, so its absence wouldn't lead to a syntax error), but about wrong offsets in the old parse tree, based on which the new tree is generated. That probably creates a wrong picture of the source text in the parser.

Ok, I made some visualization to understand it, and yeah you are right. I’ll need to modify the comment a bit.

|visible range|

updated range
-------------

|aaaaaa|
|bbbbbbbbaaaa|aa  start: 0, old_end: 0, new_end: 6
 ------          
|bbbbbbbbaaaaaa|  start: 12, old_end: 12, new_end: 14
             --


> 
>>>> I don’t think this change would have any adverse effect, because if you think of it, inserting text in a narrowed region always extends the region, rather than pushing text at the end out of the narrowed region. So the right thing to do here is in fact not clipping new_end_offset.
>>> 
>>> I figured it could be a problem if both old_end_byte and new_end_byte extend past the current restriction.
>> That should be fine (ie, technically correct), since when we widen, the clipped text are reparsed by tree-sitter as new text.
> 
> I guess the effect I was thinking of is that
> 
>  XTS_PARSER (lisp_parser)->visible_end
> 
> would end up with a higher value than BUF_ZV_BYTE. Not sure if it's a problem.

It shouldn’t be, since BUF_ZV_BYTE should automatically grow when user inserts text. Even if it does, we always call treesit_sync_visible_region to sync up visible_beg/end with BUF_(Z)V_BYTE before parsing, so it shouldn’t be a problem.

Yuan




^ permalink raw reply	[flat|nested] 16+ messages in thread

* bug#61369: Problem with keeping tree-sitter parse tree up-to-date
  2023-02-18  0:11         ` Dmitry Gutov
  2023-02-18  1:14           ` Yuan Fu
@ 2023-02-18  7:15           ` Eli Zaretskii
  2023-02-18 17:21             ` Dmitry Gutov
  1 sibling, 1 reply; 16+ messages in thread
From: Eli Zaretskii @ 2023-02-18  7:15 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: casouri, theo, 61369

> Cc: Theodor Thornhill <theo@thornhill.no>, 61369@debbugs.gnu.org
> Date: Sat, 18 Feb 2023 02:11:22 +0200
> From: Dmitry Gutov <dgutov@yandex.ru>
> 
> I'm not sure whether there is an actual hard limit on modifying the
> text outside of the current restriction.

It is, for most/all practical purposes.  If you try to modify text
outside of the current restriction, you risk many parts of code
barfing or signaling errors on you.  For example, conversion from
character to byte positions and vice versa will stop working (and in a
build with --enable-checking will actually raise SIGABRT).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* bug#61369: Problem with keeping tree-sitter parse tree up-to-date
  2023-02-18  7:15           ` Eli Zaretskii
@ 2023-02-18 17:21             ` Dmitry Gutov
  0 siblings, 0 replies; 16+ messages in thread
From: Dmitry Gutov @ 2023-02-18 17:21 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: casouri, theo, 61369

On 18/02/2023 09:15, Eli Zaretskii wrote:
>> Cc: Theodor Thornhill<theo@thornhill.no>,61369@debbugs.gnu.org
>> Date: Sat, 18 Feb 2023 02:11:22 +0200
>> From: Dmitry Gutov<dgutov@yandex.ru>
>>
>> I'm not sure whether there is an actual hard limit on modifying the
>> text outside of the current restriction.
> It is, for most/all practical purposes.  If you try to modify text
> outside of the current restriction, you risk many parts of code
> barfing or signaling errors on you.  For example, conversion from
> character to byte positions and vice versa will stop working (and in a
> build with --enable-checking will actually raise SIGABRT).

All right, thank you both.





^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-02-18 17:21 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-02-08 15:34 bug#61369: Problem with keeping tree-sitter parse tree up-to-date Dmitry Gutov
2023-02-08 18:20 ` Theodor Thornhill via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-02-08 19:41   ` Dmitry Gutov
2023-02-10  1:22 ` Yuan Fu
2023-02-10  1:38   ` Dmitry Gutov
2023-02-13  9:10 ` Yuan Fu
2023-02-13 23:59 ` Yuan Fu
2023-02-15  2:17   ` Dmitry Gutov
2023-02-15 22:44     ` Dmitry Gutov
2023-02-17 22:32       ` Yuan Fu
2023-02-18  0:11         ` Dmitry Gutov
2023-02-18  1:14           ` Yuan Fu
2023-02-18  1:25             ` Dmitry Gutov
2023-02-18 10:05               ` Yuan Fu
2023-02-18  7:15           ` Eli Zaretskii
2023-02-18 17:21             ` Dmitry Gutov

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.