unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: Yuan Fu <casouri@gmail.com>
To: Dmitry Gutov <dgutov@yandex.ru>
Cc: theo@thornhill.no, 61369@debbugs.gnu.org
Subject: bug#61369: Problem with keeping tree-sitter parse tree up-to-date
Date: Mon, 13 Feb 2023 15:59:02 -0800	[thread overview]
Message-ID: <1AC63591-F4EF-411F-B554-7CD38B4B4888@gmail.com> (raw)
In-Reply-To: <fa9dd374-3840-c4ae-6e49-fdb5b7be08ba@yandex.ru>

[-- Attachment #1: Type: text/plain, Size: 5401 bytes --]


Yuan Fu <casouri@gmail.com> writes:

> Dmitry Gutov <dgutov@yandex.ru> writes:
>
>> On 10/02/2023 03:22, Yuan Fu wrote:
>>>>   I just want to confirm that I can reproduce this, and that if you skip
>>>>   the trailing newline from the use-statement, I don't get this behavior.
>>>>   So it seems like the newline is the crucial point, right?
>>>>
>>>> Yes, same.
>>>>
>>>> Thr trailing newline is necessary.
>>>>
>>>> The empty lines at the beginning of the buffer (being copied to) are necessary to reproduce this as well.
>>> Hmmm, it might be related to how does tree-sitter does incremental
>>> parsing? If the newline is necessary, then I guess it’s not because
>>> Emacs missed characters when reporting edits to tree-sitter.
>>
>> The newline is somewhat necessary: the scenario doesn't work, for
>> example, if the pasted text doesn't include the newline but the buffer
>> had an additional (third) one at the top.
>>
>> But the scenario also doesn't work if some other (any) character is
>> removed from the yanked line before pasting: it could be even one
>> after the comment instruction (//).
>>
>> OTOH, if I add an extra char to the yanked line, anywhere, I can skip
>> the newline. E.g. I can paste
>>
>>   use std::path::{self, Path, PathBuf};  // good: std is a crate namee
>>
>> without a newline and still see the exact same syntax error.
>>
>> So it looks more like an off-by-one error somewhere. Maybe in our
>> code, but maybe in tree-sitter somewhere.
>
> Some progress report: I added a function that reads the buffer like a
> parser would, like this:
>
> DEFUN ("treesit--parser-view",
>        Ftreesit__parser_view,
>        Streesit__parser_view, 1, 1, 0,
>        doc: /* Return the view of PARSER.
> Read buffer like PARSER would into a string and return it.  */)
>   (Lisp_Object parser)
> {
>   const ptrdiff_t visible_beg = XTS_PARSER (parser)->visible_beg;
>   const ptrdiff_t visible_end = XTS_PARSER (parser)->visible_end;
>   const ptrdiff_t view_len = visible_end - visible_beg;
>
>   char *str_buf = xzalloc (view_len + 1);
>   uint32_t read = 0;
>   TSPoint pos = { 0 };
>   for (int idx = 0; idx < view_len; idx++)
>     {
>       const char *ch = treesit_read_buffer (XTS_PARSER (parser),
> 					    idx, pos, &read);
>       if (read == 0)
> 	{
> 	  xfree (str_buf);
> 	  xsignal1 (Qtreesit_error, make_fixnum (idx));
> 	}
>       else
> 	str_buf[idx] = *ch;
>     }
>   Lisp_Object ret_str = make_string (str_buf, view_len);
>   xfree (str_buf);
>   return ret_str;
> }
>
> After I follow the steps and got the error node, I run this function on
> the parser, and the returned string looks good.
>
> Next I’ll try to log every character actually read by the parser and see
> if anything seems fishy.

I don’t know if it’s good news or bad news, but it doesn’t seem like a
off-by-one. Here is what I did:

1. I applied the attached patch (patch.diff) so that treesit_read_buffer, the
function used by tree-sitter parser to read buffer contents, prints the
position it read and the character it gets to stdout.

2. I open test.rs which contains

"

let date = DateTime::<chrono::Utc>::from_utc(date, chrono::Utc);
"

as in the recipe. I have rust-ts-mode enabled, so Emacs prints the
characters read by the parser to stdout. I type return several times to
separate this first batch of output from the next, which is what I’m
interested in.

3. I paste

"use std::Path::{self, Path, PathBuf};  // good: std is a crate name
"

at the beginning of the buffer. Now the parse tree contains that error
node. I go to the terminal, copy the output out, which looks like:

0 117
1 115
2 101
3 32
0 117
1 115
2 101
...
133 59
134 10
134 10
134 10
134 10

4. I paste this output (output.txt) into a buffer, and reconstruct the text read by
the parser with (setq str (reconstruct)), where reconstruct is:

(defun reconstruct ()
  (goto-char (point-min))
  (let ((result ""))
    (while (< (point) (point-max))
      (let* ((str (buffer-substring (point) (line-end-position)))
             (nums (string-split str))
             (pos (string-to-number (car nums)))
             (char (string-to-number (cadr nums))))
        (when (not (< pos (length result)))
          (setq result (concat result
                               (make-string (- (1+ pos) (length result))
                                            ?0))))
        (setf (aref result pos) char))
      (forward-line 1))
    result))

5. I insert str into a new buffer, and (to my disappointment) the
content is identical to the buffer text.

There are two surprises here: 1) there isn’t an off-by-one bug, 2) the
parser actually read the whole buffer, rather than reading only the new
content. Then there are even less reason for it to create that error
node.

In addition, I inserted a new line in the Rust source buffer (test.rs) (which
fixes the error node), here is what the parser read after that
insertion:

"0000000000000000000000000000000000000000000000000000000000000000000



let 0000 = 000000000000000000000000000000000000000000000000000);"

0 means it didn’t read that position, we can see that the parser read
all the newlines, "let ", " = ", and ");". I can’t discern anything
interesting from that, tho.

Yuan


[-- Attachment #2: output.txt --]
[-- Type: text/plain, Size: 1375 bytes --]

0 117
1 115
2 101
3 32
0 117
1 115
2 101
3 32
4 115
3 32
4 115
5 116
6 100
7 58
4 115
5 116
6 100
7 58
8 58
9 80
10 97
11 116
12 104
13 58
9 80
13 58
14 58
15 123
16 115
17 101
18 108
19 102
20 44
16 115
17 101
18 108
19 102
20 44
21 32
22 80
21 32
22 80
23 97
24 116
25 104
26 44
22 80
26 44
27 32
28 80
27 32
28 80
29 97
30 116
31 104
32 66
33 117
34 102
35 125
28 80
35 125
36 59
37 32
38 32
39 47
40 47
37 32
38 32
39 47
40 47
41 32
42 103
43 111
44 111
45 100
46 58
47 32
48 115
49 116
50 100
51 32
52 105
53 115
54 32
55 97
56 32
57 99
58 114
59 97
60 116
61 101
62 32
63 110
64 97
65 109
66 101
67 10
68 10
69 10
70 108
67 10
68 10
69 10
70 108
71 101
72 116
73 32
70 108
71 101
72 116
73 32
74 100
73 32
74 100
75 97
76 116
77 101
78 32
74 100
75 97
78 32
79 61
78 32
79 61
80 32
81 68
80 32
81 68
82 97
83 116
84 101
85 84
86 105
87 109
88 101
89 58
81 68
89 58
90 58
91 60
92 99
93 104
94 114
95 111
96 110
97 111
98 58
92 99
93 104
94 114
98 58
99 58
100 85
101 116
102 99
103 62
100 85
103 62
104 58
105 58
106 102
107 114
108 111
109 109
110 95
111 117
112 116
113 99
114 40
106 102
107 114
114 40
115 100
116 97
117 116
118 101
119 44
115 100
116 97
119 44
120 32
121 99
120 32
121 99
122 104
123 114
124 111
125 110
126 111
127 58
121 99
122 104
123 114
127 58
128 58
129 85
130 116
131 99
132 41
129 85
133 59
134 10
132 41
133 59
134 10
134 10
134 10
134 10

[-- Attachment #3: patch.diff --]
[-- Type: application/octet-stream, Size: 1767 bytes --]

diff --git a/src/treesit.c b/src/treesit.c
index cab2f0d5354..ad87a6ae759 100644
--- a/src/treesit.c
+++ b/src/treesit.c
@@ -1101,6 +1101,13 @@ treesit_read_buffer (void *parser, uint32_t byte_index,
      assertion should never hit.  */
   eassert (len < UINT32_MAX);
   *bytes_read = (uint32_t) len;
+
+  if (*bytes_read > 0)
+    {
+      printf ("%d %d\n", byte_index, *beg);
+      fflush (stdout);
+    }
+
   return beg;
 }
 
@@ -3432,6 +3439,37 @@ DEFUN ("treesit-subtree-stat",
     }
 }
 
+DEFUN ("treesit--parser-view",
+       Ftreesit__parser_view,
+       Streesit__parser_view, 1, 1, 0,
+       doc: /* Return the view of PARSER.
+Read buffer like PARSER would into a string and return it.  */)
+  (Lisp_Object parser)
+{
+  const ptrdiff_t visible_beg = XTS_PARSER (parser)->visible_beg;
+  const ptrdiff_t visible_end = XTS_PARSER (parser)->visible_end;
+  const ptrdiff_t view_len = visible_end - visible_beg;
+
+  char *str_buf = xzalloc (view_len + 1);
+  uint32_t read = 0;
+  TSPoint pos = { 0 };
+  for (int idx = 0; idx < view_len; idx++)
+    {
+      const char *ch = treesit_read_buffer (XTS_PARSER (parser),
+					    idx, pos, &read);
+      if (read == 0)
+	{
+	  xfree (str_buf);
+	  xsignal1 (Qtreesit_error, make_fixnum (idx));
+	}
+      else
+	str_buf[idx] = *ch;
+    }
+  Lisp_Object ret_str = make_string (str_buf, view_len);
+  xfree (str_buf);
+  return ret_str;
+}
+
 #endif	/* HAVE_TREE_SITTER */
 
 DEFUN ("treesit-available-p", Ftreesit_available_p,
@@ -3633,6 +3671,8 @@ syms_of_treesit (void)
   defsubr (&Streesit_search_forward);
   defsubr (&Streesit_induce_sparse_tree);
   defsubr (&Streesit_subtree_stat);
+
+  defsubr (&Streesit__parser_view);
 #endif /* HAVE_TREE_SITTER */
   defsubr (&Streesit_available_p);
 }

  parent reply	other threads:[~2023-02-13 23:59 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-08 15:34 bug#61369: Problem with keeping tree-sitter parse tree up-to-date Dmitry Gutov
2023-02-08 18:20 ` Theodor Thornhill via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-02-08 19:41   ` Dmitry Gutov
2023-02-10  1:22 ` Yuan Fu
2023-02-10  1:38   ` Dmitry Gutov
2023-02-13  9:10 ` Yuan Fu
2023-02-13 23:59 ` Yuan Fu [this message]
2023-02-15  2:17   ` Dmitry Gutov
2023-02-15 22:44     ` Dmitry Gutov
2023-02-17 22:32       ` Yuan Fu
2023-02-18  0:11         ` Dmitry Gutov
2023-02-18  1:14           ` Yuan Fu
2023-02-18  1:25             ` Dmitry Gutov
2023-02-18 10:05               ` Yuan Fu
2023-02-18  7:15           ` Eli Zaretskii
2023-02-18 17:21             ` Dmitry Gutov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1AC63591-F4EF-411F-B554-7CD38B4B4888@gmail.com \
    --to=casouri@gmail.com \
    --cc=61369@debbugs.gnu.org \
    --cc=dgutov@yandex.ru \
    --cc=theo@thornhill.no \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).