all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Yuan Fu <casouri@gmail.com>
To: Eli Zaretskii <eliz@gnu.org>
Cc: "Stefan Kangas" <stefankangas@gmail.com>,
	"Po Lu" <luangruo@yahoo.com>, "Emacs Devel" <emacs-devel@gnu.org>,
	"Mattias Engdegård" <mattias.engdegard@gmail.com>
Subject: Re: master d995429e7bc: Use SBYTES instead of strlen in treesit.c
Date: Tue, 23 Jul 2024 10:09:33 -0700	[thread overview]
Message-ID: <6F576962-25BD-4DF1-8827-7C2C4C8C77F3@gmail.com> (raw)
In-Reply-To: <8634o1br4c.fsf@gnu.org>



> On Jul 22, 2024, at 4:30 AM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: Stefan Kangas <stefankangas@gmail.com>
>> Date: Mon, 22 Jul 2024 04:06:30 -0700
>> 
>> Po Lu <luangruo@yahoo.com> writes:
>> 
>>> Have you verified that these functions accept strings holding '\0'?
>> 
>> AFAIK, SBYTES returns the string length excluding '\0', same as strlen.
> 
> That's not the issue here.  The issue is that Emacs Lisp strings can
> include embedded null bytes, which strlen will exclude, but SBYTES
> will not.
> 
> There's perhaps a more general issue here: since tree-sitter accepts
> UTF-8 encoded strings, we should encode the Lisp strings before we
> pass them to tree-sitter.
> 
> Yuan, can you please look into this?
> 
> Btw, where does the tree-sitter docs say that all strings are supposed
> to be in UTF-8 and that their length is supposed to be passed as
> byte-counts, not character-counts?

It doesn’t say it, but since it’s C API, I think it’s natural to assume that the length we pass along the string should be byte counts. Also, there are two kinds of string we pass to tree-sitter, one is the source code, which I know for sure must be utf-8 or utf-16, and counted in bytes; the other is the query string, which I think is ASCII, but no where in the tree-sitter doc explicitly says so. Mattias might know more about it.

For source code, tree-sitter says (note “bytes_read”, and "TSInputEncodingUTF8` or `TSInputEncodingUTF16"):

 * The [`TSInput`] parameter lets you specify how to read the text. It has the
 * following three fields:
 * 1. [`read`]: A function to retrieve a chunk of text at a given byte offset
 *    and (row, column) position. The function should return a pointer to the
 *    text and write its length to the [`bytes_read`] pointer. The parser does
 *    not take ownership of this buffer; it just borrows it until it has
 *    finished reading it. The function should write a zero value to the
 *    [`bytes_read`] pointer to indicate the end of the document.
 * 2. [`payload`]: An arbitrary pointer that will be passed to each invocation
 *    of the [`read`] function.
 * 3. [`encoding`]: An indication of how the text is encoded. Either
 *    `TSInputEncodingUTF8` or `TSInputEncodingUTF16`.


For query string, tree-sitter only says:

/**
 * Create a new query from a string containing one or more S-expression
 * patterns. The query is associated with a particular language, and can
 * only be run on syntax nodes parsed with that language.
 *
 * If all of the given patterns are valid, this returns a [`TSQuery`].
 * If a pattern is invalid, this returns `NULL`, and provides two pieces
 * of information about the problem:
 * 1. The byte offset of the error is written to the `error_offset` parameter.
 * 2. The type of error is written to the `error_type` parameter.
 */
TSQuery *ts_query_new(
  const TSLanguage *language,
  const char *source,
  uint32_t source_len,
  uint32_t *error_offset,
  TSQueryError *error_type
);


Yuan


      parent reply	other threads:[~2024-07-23 17:09 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <172164369582.30827.14373383262408294645@vcs2.savannah.gnu.org>
     [not found] ` <20240722102136.6C9D6C3534A@vcs2.savannah.gnu.org>
2024-07-22 10:27   ` master d995429e7bc: Use SBYTES instead of strlen in treesit.c Po Lu
2024-07-22 11:06     ` Stefan Kangas
2024-07-22 11:30       ` Eli Zaretskii
2024-07-22 11:55         ` Stefan Kangas
2024-07-22 12:22           ` Eli Zaretskii
2024-07-23 20:42           ` Stefan Kangas
2024-07-24  9:09             ` Mattias Engdegård
2024-07-24 11:33               ` Stefan Kangas
2024-07-24 12:02                 ` Mattias Engdegård
2024-07-23 17:09         ` Yuan Fu [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6F576962-25BD-4DF1-8827-7C2C4C8C77F3@gmail.com \
    --to=casouri@gmail.com \
    --cc=eliz@gnu.org \
    --cc=emacs-devel@gnu.org \
    --cc=luangruo@yahoo.com \
    --cc=mattias.engdegard@gmail.com \
    --cc=stefankangas@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.