From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: master c69858b3f0: ; * lisp/treesit.el (treesit-ready-p): Guard against empty buffers. Date: Wed, 23 Nov 2022 17:25:50 +0200 Message-ID: <8335a9zo8x.fsf@gnu.org> References: <166916717199.12853.3816069320355351676@vcs2.savannah.gnu.org> <20221123013252.46814C004B6@vcs2.savannah.gnu.org> <83o7sxzwhn.fsf@gnu.org> <83cz9dzt8k.fsf@gnu.org> Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="20426"; mail-complaints-to="usenet@ciao.gmane.io" Cc: emacs-devel@gnu.org, casouri@gmail.com To: Stefan Monnier Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Wed Nov 23 16:26:20 2022 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1oxrdj-00052y-Tr for ged-emacs-devel@m.gmane-mx.org; Wed, 23 Nov 2022 16:26:20 +0100 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1oxrd7-0007fv-2O; Wed, 23 Nov 2022 10:25:41 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oxrd5-0007fm-PX for emacs-devel@gnu.org; Wed, 23 Nov 2022 10:25:39 -0500 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oxrd5-0007Fl-1d; Wed, 23 Nov 2022 10:25:39 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=References:Subject:In-Reply-To:To:From:Date: mime-version; bh=t+1CBx4CWK3OzSm+CHfyi8Dcwywj+koHS5WnCexfp5Y=; b=bQ8AWPU3WJim xwZHI7lnGhGuR5UR49WiQahK9I23TTZgZ0RV59OkyNokwj3hu6vYlt/0e7ecfbvzGyiVgeQO+LvuI czlvksGIikXyUvQQLrhzKajHGEOqmMKnM9jxvkDxYaH0Az71u9f+TOGedsu1RpbyvmBn6ABvSQzHu q7L6l8pTZsdROy5gfh92AE6TqytIhGssF/yLE44W5kTdRLjGgGovD4YZji9mZKS9EYul7DHahearA 2qSWOhOLjkCmv3Vj3QawCTADrGAb8obnz1KnmIv3fuJCXzVyOjWghn+XYn3bhb6rLZIpli3keK1AC rb4zfU1KLDRVdXxOLrmS0w==; Original-Received: from [87.69.77.57] (helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oxrcy-0002WW-Ox; Wed, 23 Nov 2022 10:25:37 -0500 In-Reply-To: (message from Stefan Monnier on Wed, 23 Nov 2022 09:57:38 -0500) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:300396 Archived-At: > From: Stefan Monnier > Cc: emacs-devel@gnu.org, casouri@gmail.com > Date: Wed, 23 Nov 2022 09:57:38 -0500 > > >> But my question was not so much pointing out a problem but trying to > >> understand why we chose the more complex code. > > Because we need to compare with byte positions, > > Ah, because we wrote "(in bytes)" in the docstring of > `treesit-max-buffer-size`. That's a rather unusual choice. All other > places were we use(d) a limit on the buffer size it's always been based > on the number of chars. No, not because we wrote "in bytes", but because treesit.c consistently uses byte-counts to make similar tests (with a single exception that I fixed yesterday), and keeps track of byte positions in its data structures. I assumed Yuan Fu did that for a reason, and I see at least a hint in the signature of this function, through which tree-sitter reads buffer text: static const char* treesit_read_buffer (void *parser, uint32_t byte_index, TSPoint position, uint32_t *bytes_read) which uses "byte_index and bytes_read, each of which is an unsigned 32-bit value. And since our hard limit is 4G _bytes_, it didn't seem to me consistent to test smaller limits against character counts, not byte counts. > I doubt it would make a significant difference here either (e.g. not > only the "10 times" memory use of the tree-sitter tree is obviously > a rough approximation, but I doubt it's related to the number of bytes > more than to the number of chars or even the number of lexemes). If someone looks in the tree-sitter source code and tells us that we can compare with character counts instead, I'll be the first to agree.