From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Yuan Fu Newsgroups: gmane.emacs.devel Subject: Re: master d995429e7bc: Use SBYTES instead of strlen in treesit.c Date: Tue, 23 Jul 2024 10:09:33 -0700 Message-ID: <6F576962-25BD-4DF1-8827-7C2C4C8C77F3@gmail.com> References: <172164369582.30827.14373383262408294645@vcs2.savannah.gnu.org> <20240722102136.6C9D6C3534A@vcs2.savannah.gnu.org> <87o76pyb5h.fsf@yahoo.com> <8634o1br4c.fsf@gnu.org> Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.600.62\)) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="25637"; mail-complaints-to="usenet@ciao.gmane.io" Cc: Stefan Kangas , Po Lu , Emacs Devel , =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Tue Jul 23 19:10:38 2024 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1sWJ24-0006Uw-Me for ged-emacs-devel@m.gmane-mx.org; Tue, 23 Jul 2024 19:10:36 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sWJ1L-0000w3-Dw; Tue, 23 Jul 2024 13:09:51 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sWJ1K-0000qx-50 for emacs-devel@gnu.org; Tue, 23 Jul 2024 13:09:50 -0400 Original-Received: from mail-pf1-x435.google.com ([2607:f8b0:4864:20::435]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1sWJ1I-00079A-EU; Tue, 23 Jul 2024 13:09:49 -0400 Original-Received: by mail-pf1-x435.google.com with SMTP id d2e1a72fcca58-70d1cbbeeaeso1566028b3a.0; Tue, 23 Jul 2024 10:09:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1721754586; x=1722359386; darn=gnu.org; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=7h23AHJq7Lp8Cftn3Vxp9yQ0eELppbCFUSSSc+0yfAc=; b=AzE9gwlN9u5SWltsM6Cxbiui4p7MilTAAwcIiTXlJXLW0pF0mSp1zUFQiwEb5lPiM1 32XFHjV9cPpSvuRp0ve+JRuvHUpj+wGI6jMxNVJ+ppjAEtL7SrdEA2gSjcd/SY8O09v3 9zGX3T7nUU/92QuZb+9SDrNI+XpU+U3VRVRM6tUVvRaLM8aNUSI8SANGyQSS1mQb8qLx L8oKoKAQ7FCsK1fdYoRVo+ped4YTNUIuqSGV1EUQQdIYc+cViycJ6x376YQxiuPpOMoU 0vXciXdKJX7I84YdxYig+BAMmoqKaM+/YeeGeNKtrfE2gz/6GXQuY5+S385rLYd/G8uO BqFw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721754586; x=1722359386; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=7h23AHJq7Lp8Cftn3Vxp9yQ0eELppbCFUSSSc+0yfAc=; b=lTz+j+lQHrhi6L7RyIA3EkHu5gxL8RJl3dGzNzN56yp4O0n37UV9BQBeX0AkMXdgif 3iIr53tQPu3B6j8+cwzmCizW7b5OkRFNa/kLiRheeCBDTwG+dcAlw46VVIV+9go2Dt8o efJZ+Sup8LSIPhAnZhq1BzJt66LrRy/pxvSzWX79oxlq8tAZNMV7P+xXYtzpMvLcMSwz c2OJrG7GJFyzfzEUuMrtcYkNLxmQ+q4ZCeEhxKuUhwqZAJqSuibsKwbGjJTXshKD3iM8 iLJ71kohgCh1KA79w3imgND/m6QopiHCarvYefpGLBX3mMdl41GtpHgd6x5vf0v1lmvx XGUQ== X-Forwarded-Encrypted: i=1; AJvYcCW19K04fmDP+8WeSt/9RQe5F/6Eda1UFLnR75Nr+8q0jx4xxSir1z1FOqCXmlxdOe2rd8QcG1V77S4fRexUVlIK7opJ X-Gm-Message-State: AOJu0YxHg6v9sxixuqqdtpLyU/JS7zD0lR0P8HEW31m0/7rvN32CC9wh u3DxuowZE+OLv29OOUf7BY25+vyJTkC+hwNHeXoIK5BROlfTyJuaHizsYQ== X-Google-Smtp-Source: AGHT+IG589F9BZUjRhoxZ4TyWcXEb4E8ck0zAeOZCkC2/ZwSsrH0tNz4pLFmWLEuBcYScjnXZ8UGkQ== X-Received: by 2002:a05:6a00:2190:b0:70d:3420:931e with SMTP id d2e1a72fcca58-70e9969e5demr686931b3a.15.1721754586048; Tue, 23 Jul 2024 10:09:46 -0700 (PDT) Original-Received: from smtpclient.apple ([2601:646:8f81:6120:2124:7de1:c48c:7cbe]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-70d34a276d7sm2570078b3a.180.2024.07.23.10.09.44 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 23 Jul 2024 10:09:45 -0700 (PDT) In-Reply-To: <8634o1br4c.fsf@gnu.org> X-Mailer: Apple Mail (2.3774.600.62) Received-SPF: pass client-ip=2607:f8b0:4864:20::435; envelope-from=casouri@gmail.com; helo=mail-pf1-x435.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:322002 Archived-At: > On Jul 22, 2024, at 4:30=E2=80=AFAM, Eli Zaretskii = wrote: >=20 >> From: Stefan Kangas >> Date: Mon, 22 Jul 2024 04:06:30 -0700 >>=20 >> Po Lu writes: >>=20 >>> Have you verified that these functions accept strings holding '\0'? >>=20 >> AFAIK, SBYTES returns the string length excluding '\0', same as = strlen. >=20 > That's not the issue here. The issue is that Emacs Lisp strings can > include embedded null bytes, which strlen will exclude, but SBYTES > will not. >=20 > There's perhaps a more general issue here: since tree-sitter accepts > UTF-8 encoded strings, we should encode the Lisp strings before we > pass them to tree-sitter. >=20 > Yuan, can you please look into this? >=20 > Btw, where does the tree-sitter docs say that all strings are supposed > to be in UTF-8 and that their length is supposed to be passed as > byte-counts, not character-counts? It doesn=E2=80=99t say it, but since it=E2=80=99s C API, I think it=E2=80=99= s natural to assume that the length we pass along the string should be = byte counts. Also, there are two kinds of string we pass to tree-sitter, = one is the source code, which I know for sure must be utf-8 or utf-16, = and counted in bytes; the other is the query string, which I think is = ASCII, but no where in the tree-sitter doc explicitly says so. Mattias = might know more about it. For source code, tree-sitter says (note =E2=80=9Cbytes_read=E2=80=9D, = and "TSInputEncodingUTF8` or `TSInputEncodingUTF16"): * The [`TSInput`] parameter lets you specify how to read the text. It = has the * following three fields: * 1. [`read`]: A function to retrieve a chunk of text at a given byte = offset * and (row, column) position. The function should return a pointer = to the * text and write its length to the [`bytes_read`] pointer. The = parser does * not take ownership of this buffer; it just borrows it until it has * finished reading it. The function should write a zero value to the * [`bytes_read`] pointer to indicate the end of the document. * 2. [`payload`]: An arbitrary pointer that will be passed to each = invocation * of the [`read`] function. * 3. [`encoding`]: An indication of how the text is encoded. Either * `TSInputEncodingUTF8` or `TSInputEncodingUTF16`. For query string, tree-sitter only says: /** * Create a new query from a string containing one or more S-expression * patterns. The query is associated with a particular language, and can * only be run on syntax nodes parsed with that language. * * If all of the given patterns are valid, this returns a [`TSQuery`]. * If a pattern is invalid, this returns `NULL`, and provides two pieces * of information about the problem: * 1. The byte offset of the error is written to the `error_offset` = parameter. * 2. The type of error is written to the `error_type` parameter. */ TSQuery *ts_query_new( const TSLanguage *language, const char *source, uint32_t source_len, uint32_t *error_offset, TSQueryError *error_type ); Yuan=