unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Re: master d995429e7bc: Use SBYTES instead of strlen in treesit.c
       [not found] ` <20240722102136.6C9D6C3534A@vcs2.savannah.gnu.org>
@ 2024-07-22 10:27   ` Po Lu
  2024-07-22 11:06     ` Stefan Kangas
  0 siblings, 1 reply; 10+ messages in thread
From: Po Lu @ 2024-07-22 10:27 UTC (permalink / raw)
  To: emacs-devel; +Cc: Stefan Kangas

Stefan Kangas <stefankangas@gmail.com> writes:

> branch: master
> commit d995429e7bc2bc5b5d87db45dbbaca121f8118e5
> Author: Stefan Kangas <stefankangas@gmail.com>
> Commit: Stefan Kangas <stefankangas@gmail.com>
>
>     Use SBYTES instead of strlen in treesit.c
>     
>     * src/treesit.c (treesit_ensure_query_compiled)
>     (Ftreesit_node_child_by_field_name, treesit_initialize_query):
>     Use SBYTES instead of strlen.
> ---
>  src/treesit.c | 12 ++++--------
>  1 file changed, 4 insertions(+), 8 deletions(-)
>
> diff --git a/src/treesit.c b/src/treesit.c
> index a420ef77b2d..c95a9fb9b01 100644
> --- a/src/treesit.c
> +++ b/src/treesit.c
> @@ -1310,9 +1310,7 @@ treesit_ensure_query_compiled (Lisp_Object query, Lisp_Object *signal_symbol,
>    /* Create TSQuery.  */
>    uint32_t error_offset;
>    TSQueryError error_type;
> -  char *treesit_source = SSDATA (source);
> -  treesit_query = ts_query_new (treesit_lang, treesit_source,
> -				strlen (treesit_source),
> +  treesit_query = ts_query_new (treesit_lang, SSDATA (source), SBYTES (source),
>  				&error_offset, &error_type);
>    if (treesit_query == NULL)
>      {
> @@ -2159,11 +2157,10 @@ Return nil if there is no such child.  If NODE is nil, return nil.  */)
>    CHECK_STRING (field_name);
>    treesit_initialize ();
>  
> -  char *name_str = SSDATA (field_name);
>    TSNode treesit_node = XTS_NODE (node)->node;
>    TSNode child
> -    = ts_node_child_by_field_name (treesit_node, name_str,
> -				   strlen (name_str));
> +    = ts_node_child_by_field_name (treesit_node, SSDATA (field_name),
> +				   SBYTES (field_name));

Have you verified that these functions accept strings holding '\0'?



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: master d995429e7bc: Use SBYTES instead of strlen in treesit.c
  2024-07-22 10:27   ` master d995429e7bc: Use SBYTES instead of strlen in treesit.c Po Lu
@ 2024-07-22 11:06     ` Stefan Kangas
  2024-07-22 11:30       ` Eli Zaretskii
  0 siblings, 1 reply; 10+ messages in thread
From: Stefan Kangas @ 2024-07-22 11:06 UTC (permalink / raw)
  To: Po Lu, emacs-devel

Po Lu <luangruo@yahoo.com> writes:

> Have you verified that these functions accept strings holding '\0'?

AFAIK, SBYTES returns the string length excluding '\0', same as strlen.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: master d995429e7bc: Use SBYTES instead of strlen in treesit.c
  2024-07-22 11:06     ` Stefan Kangas
@ 2024-07-22 11:30       ` Eli Zaretskii
  2024-07-22 11:55         ` Stefan Kangas
  2024-07-23 17:09         ` Yuan Fu
  0 siblings, 2 replies; 10+ messages in thread
From: Eli Zaretskii @ 2024-07-22 11:30 UTC (permalink / raw)
  To: Stefan Kangas, Yuan Fu; +Cc: luangruo, emacs-devel

> From: Stefan Kangas <stefankangas@gmail.com>
> Date: Mon, 22 Jul 2024 04:06:30 -0700
> 
> Po Lu <luangruo@yahoo.com> writes:
> 
> > Have you verified that these functions accept strings holding '\0'?
> 
> AFAIK, SBYTES returns the string length excluding '\0', same as strlen.

That's not the issue here.  The issue is that Emacs Lisp strings can
include embedded null bytes, which strlen will exclude, but SBYTES
will not.

There's perhaps a more general issue here: since tree-sitter accepts
UTF-8 encoded strings, we should encode the Lisp strings before we
pass them to tree-sitter.

Yuan, can you please look into this?

Btw, where does the tree-sitter docs say that all strings are supposed
to be in UTF-8 and that their length is supposed to be passed as
byte-counts, not character-counts?



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: master d995429e7bc: Use SBYTES instead of strlen in treesit.c
  2024-07-22 11:30       ` Eli Zaretskii
@ 2024-07-22 11:55         ` Stefan Kangas
  2024-07-22 12:22           ` Eli Zaretskii
  2024-07-23 20:42           ` Stefan Kangas
  2024-07-23 17:09         ` Yuan Fu
  1 sibling, 2 replies; 10+ messages in thread
From: Stefan Kangas @ 2024-07-22 11:55 UTC (permalink / raw)
  To: Eli Zaretskii, Yuan Fu; +Cc: luangruo, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> > Have you verified that these functions accept strings holding '\0'?
>>
>> AFAIK, SBYTES returns the string length excluding '\0', same as strlen.
>
> That's not the issue here.  The issue is that Emacs Lisp strings can
> include embedded null bytes, which strlen will exclude, but SBYTES
> will not.

We avoid that when they are converted from a sexp (see
treesit_query_string_string).

Perhaps we should do the same when we get a string?



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: master d995429e7bc: Use SBYTES instead of strlen in treesit.c
  2024-07-22 11:55         ` Stefan Kangas
@ 2024-07-22 12:22           ` Eli Zaretskii
  2024-07-23 20:42           ` Stefan Kangas
  1 sibling, 0 replies; 10+ messages in thread
From: Eli Zaretskii @ 2024-07-22 12:22 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: casouri, luangruo, emacs-devel

> From: Stefan Kangas <stefankangas@gmail.com>
> Date: Mon, 22 Jul 2024 04:55:00 -0700
> Cc: luangruo@yahoo.com, emacs-devel@gnu.org
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> > Have you verified that these functions accept strings holding '\0'?
> >>
> >> AFAIK, SBYTES returns the string length excluding '\0', same as strlen.
> >
> > That's not the issue here.  The issue is that Emacs Lisp strings can
> > include embedded null bytes, which strlen will exclude, but SBYTES
> > will not.
> 
> We avoid that when they are converted from a sexp (see
> treesit_query_string_string).
> 
> Perhaps we should do the same when we get a string?

If that is what tree-sitter expects, yes.  But does it?



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: master d995429e7bc: Use SBYTES instead of strlen in treesit.c
  2024-07-22 11:30       ` Eli Zaretskii
  2024-07-22 11:55         ` Stefan Kangas
@ 2024-07-23 17:09         ` Yuan Fu
  1 sibling, 0 replies; 10+ messages in thread
From: Yuan Fu @ 2024-07-23 17:09 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Stefan Kangas, Po Lu, Emacs Devel, Mattias Engdegård



> On Jul 22, 2024, at 4:30 AM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: Stefan Kangas <stefankangas@gmail.com>
>> Date: Mon, 22 Jul 2024 04:06:30 -0700
>> 
>> Po Lu <luangruo@yahoo.com> writes:
>> 
>>> Have you verified that these functions accept strings holding '\0'?
>> 
>> AFAIK, SBYTES returns the string length excluding '\0', same as strlen.
> 
> That's not the issue here.  The issue is that Emacs Lisp strings can
> include embedded null bytes, which strlen will exclude, but SBYTES
> will not.
> 
> There's perhaps a more general issue here: since tree-sitter accepts
> UTF-8 encoded strings, we should encode the Lisp strings before we
> pass them to tree-sitter.
> 
> Yuan, can you please look into this?
> 
> Btw, where does the tree-sitter docs say that all strings are supposed
> to be in UTF-8 and that their length is supposed to be passed as
> byte-counts, not character-counts?

It doesn’t say it, but since it’s C API, I think it’s natural to assume that the length we pass along the string should be byte counts. Also, there are two kinds of string we pass to tree-sitter, one is the source code, which I know for sure must be utf-8 or utf-16, and counted in bytes; the other is the query string, which I think is ASCII, but no where in the tree-sitter doc explicitly says so. Mattias might know more about it.

For source code, tree-sitter says (note “bytes_read”, and "TSInputEncodingUTF8` or `TSInputEncodingUTF16"):

 * The [`TSInput`] parameter lets you specify how to read the text. It has the
 * following three fields:
 * 1. [`read`]: A function to retrieve a chunk of text at a given byte offset
 *    and (row, column) position. The function should return a pointer to the
 *    text and write its length to the [`bytes_read`] pointer. The parser does
 *    not take ownership of this buffer; it just borrows it until it has
 *    finished reading it. The function should write a zero value to the
 *    [`bytes_read`] pointer to indicate the end of the document.
 * 2. [`payload`]: An arbitrary pointer that will be passed to each invocation
 *    of the [`read`] function.
 * 3. [`encoding`]: An indication of how the text is encoded. Either
 *    `TSInputEncodingUTF8` or `TSInputEncodingUTF16`.


For query string, tree-sitter only says:

/**
 * Create a new query from a string containing one or more S-expression
 * patterns. The query is associated with a particular language, and can
 * only be run on syntax nodes parsed with that language.
 *
 * If all of the given patterns are valid, this returns a [`TSQuery`].
 * If a pattern is invalid, this returns `NULL`, and provides two pieces
 * of information about the problem:
 * 1. The byte offset of the error is written to the `error_offset` parameter.
 * 2. The type of error is written to the `error_type` parameter.
 */
TSQuery *ts_query_new(
  const TSLanguage *language,
  const char *source,
  uint32_t source_len,
  uint32_t *error_offset,
  TSQueryError *error_type
);


Yuan


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: master d995429e7bc: Use SBYTES instead of strlen in treesit.c
  2024-07-22 11:55         ` Stefan Kangas
  2024-07-22 12:22           ` Eli Zaretskii
@ 2024-07-23 20:42           ` Stefan Kangas
  2024-07-24  9:09             ` Mattias Engdegård
  1 sibling, 1 reply; 10+ messages in thread
From: Stefan Kangas @ 2024-07-23 20:42 UTC (permalink / raw)
  To: Eli Zaretskii, Yuan Fu; +Cc: luangruo, emacs-devel, Mattias Engdegård

Stefan Kangas <stefankangas@gmail.com> writes:

> Eli Zaretskii <eliz@gnu.org> writes:
>
>>> > Have you verified that these functions accept strings holding '\0'?
>>>
>>> AFAIK, SBYTES returns the string length excluding '\0', same as strlen.
>>
>> That's not the issue here.  The issue is that Emacs Lisp strings can
>> include embedded null bytes, which strlen will exclude, but SBYTES
>> will not.
>
> We avoid that when they are converted from a sexp (see
> treesit_query_string_string).
>
> Perhaps we should do the same when we get a string?

Mattias, I see you introduced treesit_query_string_string.

Do you have any thoughts about this?



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: master d995429e7bc: Use SBYTES instead of strlen in treesit.c
  2024-07-23 20:42           ` Stefan Kangas
@ 2024-07-24  9:09             ` Mattias Engdegård
  2024-07-24 11:33               ` Stefan Kangas
  0 siblings, 1 reply; 10+ messages in thread
From: Mattias Engdegård @ 2024-07-24  9:09 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: Eli Zaretskii, Yuan Fu, luangruo, emacs-devel

23 juli 2024 kl. 22.42 skrev Stefan Kangas <stefankangas@gmail.com>:

>> We avoid that when they are converted from a sexp (see
>> treesit_query_string_string).
>> 
>> Perhaps we should do the same when we get a string?

Tree-sitter is something of which I know very little, but if you mean taking care of NULs when we create a Lisp string from what we get from tree-sitter, then it doesn't seem to be too broken at a quick glance:

`ts_query_capture_name_for_id` and `ts_query_string_value_for_id` return byte-counted strings and we do a somewhat reasonable job converting them.
(intern_c_string_1 is a bit questionable but so is the entire interning system; I may do something about that in Emacs 31 later on.)

Most other strings returned from TS appear to be NUL-terminated and we don't get any explicit length anyway.

For example, `ts_node_string` returns NUL-terminated string but it's a (TS) S-expression so it may contain strings in TS syntax with escaped NULs.

(We have too many string constructors in general and several of them don't do precisely what the caller thought they would but that's a different problem, to be dealt with another day.)

Also, TS string literals are almost but not quite parsable by the Lisp reader (why?) but I don't think we use the reader for them anywhere. `ts-node-string` seems to be used for testing only.




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: master d995429e7bc: Use SBYTES instead of strlen in treesit.c
  2024-07-24  9:09             ` Mattias Engdegård
@ 2024-07-24 11:33               ` Stefan Kangas
  2024-07-24 12:02                 ` Mattias Engdegård
  0 siblings, 1 reply; 10+ messages in thread
From: Stefan Kangas @ 2024-07-24 11:33 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: Eli Zaretskii, Yuan Fu, luangruo, emacs-devel

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

> 23 juli 2024 kl. 22.42 skrev Stefan Kangas <stefankangas@gmail.com>:
>
>>> We avoid that when they are converted from a sexp (see
>>> treesit_query_string_string).
>>>
>>> Perhaps we should do the same when we get a string?
>
> Tree-sitter is something of which I know very little, but if you mean
> taking care of NULs when we create a Lisp string from what we get from
> tree-sitter, then it doesn't seem to be too broken at a quick glance:

It's not about what we get from tree-sitter, but what we pass to it.

I recently changed calls to ts_node_child_by_field_name and
treesit_query from using strlen for the length argument to using SBYTES.
But that meant that we now pass a length of 7 instead of 3 for Lisp
strings like "abc\^@def".  The question is if tree-sitter will accept
that.

If it doesn't, I proposed that we might want to escape the NULs using
treesit_query_string_string.  I'm struggling to find anything clear in
the tree-sitter documentation about this, but I see your recent change.

The other options are either warning about such Lisp strings when we get
them, or (my least favorite option) just revert back to using strlen.

> (We have too many string constructors in general and several of them
> don't do precisely what the caller thought they would but that's a
> different problem, to be dealt with another day.)

Cleaning that up would be welcome in my book, FWIW.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: master d995429e7bc: Use SBYTES instead of strlen in treesit.c
  2024-07-24 11:33               ` Stefan Kangas
@ 2024-07-24 12:02                 ` Mattias Engdegård
  0 siblings, 0 replies; 10+ messages in thread
From: Mattias Engdegård @ 2024-07-24 12:02 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: Eli Zaretskii, Yuan Fu, luangruo, emacs-devel

24 juli 2024 kl. 13.33 skrev Stefan Kangas <stefankangas@gmail.com>:

> It's not about what we get from tree-sitter, but what we pass to it.
> 
> I recently changed calls to ts_node_child_by_field_name and
> treesit_query from using strlen for the length argument to using SBYTES.
> But that meant that we now pass a length of 7 instead of 3 for Lisp
> strings like "abc\^@def".  The question is if tree-sitter will accept
> that.

Yes, your change makes sense and I have no reason to object to it.
For that matter, the query is supposed to be a string in TS S-exp format, and we serialise the query passed to `treesit-query-capture` etc into that format. If the query is a string then it's supposed to be well-formed already, of course, but that doesn't seem to be a common mode of usage.

But if you ask whether an attacker could somehow use NULs in the text being parsed as a way to subvert tree-sitter and/or Emacs, then I cannot reply with any confidence. Too many moving parts and too little compartmentalisation for that.




^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-07-24 12:02 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <172164369582.30827.14373383262408294645@vcs2.savannah.gnu.org>
     [not found] ` <20240722102136.6C9D6C3534A@vcs2.savannah.gnu.org>
2024-07-22 10:27   ` master d995429e7bc: Use SBYTES instead of strlen in treesit.c Po Lu
2024-07-22 11:06     ` Stefan Kangas
2024-07-22 11:30       ` Eli Zaretskii
2024-07-22 11:55         ` Stefan Kangas
2024-07-22 12:22           ` Eli Zaretskii
2024-07-23 20:42           ` Stefan Kangas
2024-07-24  9:09             ` Mattias Engdegård
2024-07-24 11:33               ` Stefan Kangas
2024-07-24 12:02                 ` Mattias Engdegård
2024-07-23 17:09         ` Yuan Fu

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).