From: Yuan Fu <casouri@gmail.com>
To: Eli Zaretskii <eliz@gnu.org>
Cc: "Stefan Kangas" <stefankangas@gmail.com>,
"Po Lu" <luangruo@yahoo.com>, "Emacs Devel" <emacs-devel@gnu.org>,
"Mattias Engdegård" <mattias.engdegard@gmail.com>
Subject: Re: master d995429e7bc: Use SBYTES instead of strlen in treesit.c
Date: Tue, 23 Jul 2024 10:09:33 -0700 [thread overview]
Message-ID: <6F576962-25BD-4DF1-8827-7C2C4C8C77F3@gmail.com> (raw)
In-Reply-To: <8634o1br4c.fsf@gnu.org>
> On Jul 22, 2024, at 4:30 AM, Eli Zaretskii <eliz@gnu.org> wrote:
>
>> From: Stefan Kangas <stefankangas@gmail.com>
>> Date: Mon, 22 Jul 2024 04:06:30 -0700
>>
>> Po Lu <luangruo@yahoo.com> writes:
>>
>>> Have you verified that these functions accept strings holding '\0'?
>>
>> AFAIK, SBYTES returns the string length excluding '\0', same as strlen.
>
> That's not the issue here. The issue is that Emacs Lisp strings can
> include embedded null bytes, which strlen will exclude, but SBYTES
> will not.
>
> There's perhaps a more general issue here: since tree-sitter accepts
> UTF-8 encoded strings, we should encode the Lisp strings before we
> pass them to tree-sitter.
>
> Yuan, can you please look into this?
>
> Btw, where does the tree-sitter docs say that all strings are supposed
> to be in UTF-8 and that their length is supposed to be passed as
> byte-counts, not character-counts?
It doesn’t say it, but since it’s C API, I think it’s natural to assume that the length we pass along the string should be byte counts. Also, there are two kinds of string we pass to tree-sitter, one is the source code, which I know for sure must be utf-8 or utf-16, and counted in bytes; the other is the query string, which I think is ASCII, but no where in the tree-sitter doc explicitly says so. Mattias might know more about it.
For source code, tree-sitter says (note “bytes_read”, and "TSInputEncodingUTF8` or `TSInputEncodingUTF16"):
* The [`TSInput`] parameter lets you specify how to read the text. It has the
* following three fields:
* 1. [`read`]: A function to retrieve a chunk of text at a given byte offset
* and (row, column) position. The function should return a pointer to the
* text and write its length to the [`bytes_read`] pointer. The parser does
* not take ownership of this buffer; it just borrows it until it has
* finished reading it. The function should write a zero value to the
* [`bytes_read`] pointer to indicate the end of the document.
* 2. [`payload`]: An arbitrary pointer that will be passed to each invocation
* of the [`read`] function.
* 3. [`encoding`]: An indication of how the text is encoded. Either
* `TSInputEncodingUTF8` or `TSInputEncodingUTF16`.
For query string, tree-sitter only says:
/**
* Create a new query from a string containing one or more S-expression
* patterns. The query is associated with a particular language, and can
* only be run on syntax nodes parsed with that language.
*
* If all of the given patterns are valid, this returns a [`TSQuery`].
* If a pattern is invalid, this returns `NULL`, and provides two pieces
* of information about the problem:
* 1. The byte offset of the error is written to the `error_offset` parameter.
* 2. The type of error is written to the `error_type` parameter.
*/
TSQuery *ts_query_new(
const TSLanguage *language,
const char *source,
uint32_t source_len,
uint32_t *error_offset,
TSQueryError *error_type
);
Yuan
prev parent reply other threads:[~2024-07-23 17:09 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <172164369582.30827.14373383262408294645@vcs2.savannah.gnu.org>
[not found] ` <20240722102136.6C9D6C3534A@vcs2.savannah.gnu.org>
2024-07-22 10:27 ` master d995429e7bc: Use SBYTES instead of strlen in treesit.c Po Lu
2024-07-22 11:06 ` Stefan Kangas
2024-07-22 11:30 ` Eli Zaretskii
2024-07-22 11:55 ` Stefan Kangas
2024-07-22 12:22 ` Eli Zaretskii
2024-07-23 20:42 ` Stefan Kangas
2024-07-24 9:09 ` Mattias Engdegård
2024-07-24 11:33 ` Stefan Kangas
2024-07-24 12:02 ` Mattias Engdegård
2024-07-23 17:09 ` Yuan Fu [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6F576962-25BD-4DF1-8827-7C2C4C8C77F3@gmail.com \
--to=casouri@gmail.com \
--cc=eliz@gnu.org \
--cc=emacs-devel@gnu.org \
--cc=luangruo@yahoo.com \
--cc=mattias.engdegard@gmail.com \
--cc=stefankangas@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.