From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.ciao.gmane.io!not-for-mail From: Stephen Leake Newsgroups: gmane.emacs.devel Subject: Re: [SPAM UNSURE] Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) Date: Fri, 03 Apr 2020 10:11:05 -0800 Message-ID: <86zhbse8xy.fsf@stephe-leake.org> References: <83369o1khx.fsf@gnu.org> <83imijz68s.fsf@gnu.org> <831rp7ypam.fsf@gnu.org> <86wo6yhj4d.fsf@stephe-leake.org> <83o8sax803.fsf@gnu.org> <86pncpffmk.fsf@stephe-leake.org> <83y2rdvwm1.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain Injection-Info: ciao.gmane.io; posting-host="ciao.gmane.io:159.69.161.202"; logging-data="26034"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.2 (windows-nt) To: emacs-devel Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Fri Apr 03 20:12:29 2020 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1jKQns-0006ed-Uo for ged-emacs-devel@m.gmane-mx.org; Fri, 03 Apr 2020 20:12:28 +0200 Original-Received: from localhost ([::1]:59374 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jKQnr-0002Jd-Vw for ged-emacs-devel@m.gmane-mx.org; Fri, 03 Apr 2020 14:12:28 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:57819) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jKQmi-0001Ul-6a for emacs-devel@gnu.org; Fri, 03 Apr 2020 14:11:17 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1jKQmg-0004QZ-Mq for emacs-devel@gnu.org; Fri, 03 Apr 2020 14:11:15 -0400 Original-Received: from gateway31.websitewelcome.com ([192.185.144.91]:17694) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1jKQmg-0004LU-E4 for emacs-devel@gnu.org; Fri, 03 Apr 2020 14:11:14 -0400 Original-Received: from cm13.websitewelcome.com (cm13.websitewelcome.com [100.42.49.6]) by gateway31.websitewelcome.com (Postfix) with ESMTP id 3A08C4B68E6 for ; Fri, 3 Apr 2020 13:11:11 -0500 (CDT) Original-Received: from host2007.hostmonster.com ([67.20.76.71]) by cmsmtp with SMTP id KQmcjslLUVQh0KQmdj56Iz; Fri, 03 Apr 2020 13:11:11 -0500 X-Authority-Reason: nr=8 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=stephe-leake.org; s=default; h=Content-Type:MIME-Version:Message-ID: In-Reply-To:Date:References:Subject:To:From:Sender:Reply-To:Cc: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=ZxQmqaO7QF/YA7nyrBS2oMy9m1iduWfsASwOcMZWLFk=; b=jDgrrUI0w65czeYaGIxDp/kO+ Wx6GJ2lmtGbk3geZdctVBdxR7xBvpa0Homq0Ni+E5VlqobJY2eNWRaX85YmNGBZHzDY6zCYxdkSNY ejgMjQH5oA9rpUMWUrCr2FNxWFSvYOTzGDscpzX1JHdJ3RGcCoA+3N8mHSPuSWOzpUcnNJ/Vtqpzg DgzvlG/LWJUNHiY2hUkjR8TaBjfLpx8uN+nNDzYFL5vxVPLrGX4YrClQ+X1lmyMVNe1EnFkm8nxj8 Gya349u1vxCfjnF68b6W2Fe7m/3ylCNSouf36yr3PHm2xI98JAX2elx8FtA7jGn0sgDDlZkgjDY7u MRcDmTaqw==; Original-Received: from [76.77.182.20] (port=64288 helo=Takver4) by host2007.hostmonster.com with esmtpsa (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.92) (envelope-from ) id 1jKQmc-000dNj-J2 for emacs-devel@gnu.org; Fri, 03 Apr 2020 12:11:10 -0600 In-Reply-To: <83y2rdvwm1.fsf@gnu.org> (Eli Zaretskii's message of "Fri, 03 Apr 2020 10:47:50 +0300") X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - host2007.hostmonster.com X-AntiAbuse: Original Domain - gnu.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - stephe-leake.org X-BWhitelist: no X-Source-IP: 76.77.182.20 X-Source-L: No X-Exim-ID: 1jKQmc-000dNj-J2 X-Source-Sender: (Takver4) [76.77.182.20]:64288 X-Source-Auth: stephen_leake@stephe-leake.org X-Email-Count: 1 X-Source-Cap: c3RlcGhlbGU7c3RlcGhlbGU7aG9zdDIwMDcuaG9zdG1vbnN0ZXIuY29t X-Local-Domain: yes X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 192.185.144.91 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:246348 Archived-At: Eli Zaretskii writes: >> From: Stephen Leake >> Date: Thu, 02 Apr 2020 18:49:07 -0800 >> >> > I think we should try to avoid both copying and encoding the text we >> > send to the parser. Both operations are expensive and require memory >> > allocation. >> >> I don't understand what the alternative is. The parser imposes the >> reasonable requirement that the input text be utf-8 (or possibly some >> other standard format). Emacs raw buffer text is not utf-8, so we must >> do some encoding. > > Emacs represents buffer text as a superset of UTF-8, with the > violations of strict UTF-8 being very rare in buffers that hold > program sources. The function we can provide that lets tree-sitter > access buffer text can cope with those violations, Ok. "cope with those violations" = "do some encoding". We can avoid copying _if_ the encoding does not change character positions, or somehow preserves positions, for example with an auxiliary table of changes due to encoding. Coping with violations in the lexer would make it much easier to avoid changing character positions; it is easy to simply ignore bytes there. wisi makes it easy to implement this in the lexer (because it uses re2c), although currently there is no way to make that language-specific (that would be an enhancement). https://tree-sitter.github.io/tree-sitter/creating-parsers#external-scanners describes the facility for enhancing the tree-sitter lexer (aka scanner). That is not convenient for handling this issue, so we'd have to request (and or provide) an enhancement. We cannot avoid encoding (either in the read function provided to tree-sitter, or in the tree-sitter lexer), but the encoding may be very simple and efficient. -- -- Stephe