From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.ciao.gmane.io!not-for-mail From: Stephen Leake Newsgroups: gmane.emacs.devel Subject: Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) Date: Wed, 01 Apr 2020 15:38:26 -0800 Message-ID: <86wo6yhj4d.fsf@stephe-leake.org> References: <83369o1khx.fsf@gnu.org> <83imijz68s.fsf@gnu.org> <831rp7ypam.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain Injection-Info: ciao.gmane.io; posting-host="ciao.gmane.io:159.69.161.202"; logging-data="49693"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.2 (windows-nt) To: emacs-devel Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Thu Apr 02 01:39:37 2020 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1jJmxN-000CpB-5l for ged-emacs-devel@m.gmane-mx.org; Thu, 02 Apr 2020 01:39:37 +0200 Original-Received: from localhost ([::1]:38610 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jJmxM-00045N-8X for ged-emacs-devel@m.gmane-mx.org; Wed, 01 Apr 2020 19:39:36 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:43664) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jJmwO-0003KL-G3 for emacs-devel@gnu.org; Wed, 01 Apr 2020 19:38:38 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1jJmwM-0001nE-5F for emacs-devel@gnu.org; Wed, 01 Apr 2020 19:38:35 -0400 Original-Received: from gateway30.websitewelcome.com ([192.185.149.4]:13096) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1jJmwL-0001jr-1Q for emacs-devel@gnu.org; Wed, 01 Apr 2020 19:38:33 -0400 Original-Received: from cm14.websitewelcome.com (cm14.websitewelcome.com [100.42.49.7]) by gateway30.websitewelcome.com (Postfix) with ESMTP id DB73A37F27 for ; Wed, 1 Apr 2020 18:38:30 -0500 (CDT) Original-Received: from host2007.hostmonster.com ([67.20.76.71]) by cmsmtp with SMTP id JmwIj61DJXVkQJmwIjKo4I; Wed, 01 Apr 2020 18:38:30 -0500 X-Authority-Reason: nr=8 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=stephe-leake.org; s=default; h=Content-Type:MIME-Version:Message-ID: In-Reply-To:Date:References:Subject:To:From:Sender:Reply-To:Cc: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=ilxQ5PrJaDpjMrvyHHo+WC5dDUSFCQ3s5vGDzlU6kfM=; b=XvwwBelYL404rzAib0x4ktadm 0ojGe7PCjbz7oHXM+Xep/pmAcaDrmWhsKuWUkJRJgAtLO042Yf5f3Lpk0yvdMn3PM1cCaWNVfxxZb cInlggX1KxT9h17RWVv47K1IPFDzQ1936PhuZGKlZqj1NDLWRX6UMx7dhZ/M3lKvk3/cAsrmH1bDi 8267Ae4VgHyeWG0slqUsHsm1CEg5v+OVy4gygc3jKVGnDY8IgZ0jHRMik6aQui6JX3Gls3C1Migoc 8fSdSve+QzKm5MVYQrGoRAPkieblbSh760Xr7ullhXlNDoxMb2MpEfJgl3Jeitl1OKx2ukUVayduK n68coLRlw==; Original-Received: from [76.77.182.20] (port=62711 helo=Takver4) by host2007.hostmonster.com with esmtpsa (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.92) (envelope-from ) id 1jJmwI-001i5T-8K for emacs-devel@gnu.org; Wed, 01 Apr 2020 17:38:30 -0600 In-Reply-To: <831rp7ypam.fsf@gnu.org> (Eli Zaretskii's message of "Wed, 01 Apr 2020 22:33:05 +0300") X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - host2007.hostmonster.com X-AntiAbuse: Original Domain - gnu.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - stephe-leake.org X-BWhitelist: no X-Source-IP: 76.77.182.20 X-Source-L: No X-Exim-ID: 1jJmwI-001i5T-8K X-Source-Sender: (Takver4) [76.77.182.20]:62711 X-Source-Auth: stephen_leake@stephe-leake.org X-Email-Count: 1 X-Source-Cap: c3RlcGhlbGU7c3RlcGhlbGU7aG9zdDIwMDcuaG9zdG1vbnN0ZXIuY29t X-Local-Domain: yes X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 192.185.149.4 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:246227 Archived-At: Eli Zaretskii writes: > Also, direct access to buffer text generally means we must make sure > GC never runs as long as pointers to buffer text are lying around. > Can any Lisp run between calls to the reader function that the > tree-sitter parser calls to access the buffer text? If the parser copies the text into an internal buffer, that reader function should only be called once per call to the parser. Parsers used to based around small buffers that would read in a file a chunk at a time, but that is not necessary on any machine that can run Emacs. Since Emacs has the entire file in memory, the parser can too. However, if we are really trying to avoid copying text (which is very premature optimization), then the reader function will be called many times during parsing (to fetch each word), and possibly during the grammar actions (to compute indent or face). > Next, I'm still asking whether parsing the whole buffer when it is > first created is necessary. To some extent, that depends on the language. The parser must be able to complete a parse, to generate a complete syntax tree. I'll assume no error correction for a moment; more below. In C or C++ body files, "a complete parse" is typically one variable or function declaration. So if Emacs can reliably find the beginning and end of those declarations, it could pass just the ones containing the region of interest to the parser. Tree-sitter (if it supports this at all, or is modified to) would end up with a forest of small parse trees, rather than one large one. They might get merged if large chunks of text are parsed together. In Ada and Java, and most C++ header files, "a complete parse" is a file; it contains an Ada package spec or body, or a Java or C++ class, or a C++ namespace. There are many "small languages" for which "a complete parse" is similar to a "statement". Bash shell, for example. They could pass just the statement, but only if Emacs can reliably find the start and end (not always easy). It is also possible to modify the language grammar to allow smaller pieces of code to be a complete parse; ada-mode does this, making a single declaration or statement "a complete parse", in order to support "partial parse". That can easily lead to errors in indent, since the indent of the start of the text portion is unknown (ada-mode simply assumes it is correct in the buffer). Another reason to allow smaller code chunks to be a complete parse is to allow parsing the code fragments that appear in grammar actions; the ELPA package wisitoken-grammar-mode uses this for wisitoken grammar files with Ada actions. In sum, the short answer is "yes, you must parse the whole file, unless your language is particularly simple". Since we need to support the worst case, we should assume the whole file must be parsed at least once. > Can we pass to the parser just a small chunk (say, 500 bytes) of the > buffer around the window-full to be displayed next? If this presents > problems, what are those problems? In wisi, the error correction code will fill in the missing text so a complete parse is possible. Since some of that is guesses, the results may not be very good. Tree-sitter also has error correction; I'm not clear how good it is. > IOW, the issue with exposing access to buffer text to modules is IMO > secondary. yes, because copying text is fast compared to everything else going on. > My suggestion is first to figure out how to do this stuff efficiently > from within Emacs itself, as if the module interface were not part of > the equation. We can add that aspect back later. There are two times the wisi code that wraps the parser needs access to the buffer; first to copy the text, second to add text properties (faces, indent values, navigation markers). There are usually many text properties output by each parse. The positions and values of the text properties are computed by functions that run after the complete syntax tree has been produced. In wisi, those functions are added directly in the grammar source file (where they are called "post-parse grammar actions"). In tree-sitter, I assume they are called from some mode-author-written code that traverses the syntax tree (wisi provides that internally). Except I see below that the emacs tree-sitter package stores the syntax tree in the buffer. One option here is to try to standardize on an elisp representation of a syntax tree, and have both the wisi and tree-sitter parsers provide that. Then the grammar actions could be implemented in elisp. I suspect that would be very slow; elisp is just not good at traversing large complex data structures. That is not just my bias showing (I _much_ prefer doing as much as possible in Ada); I first wrote the ada-mode parser and grammar actions in elisp, and then did a complete rewrite in Ada, gaining significant speed. Although I never considered passing the syntax tree to elisp as a single object, so maybe that could work well. There is no universal standard for representing "a syntax tree". In wisi, the tree is directly produced by the LR shift and reduce operations, and thus is very close the the grammar expressed in BNF. I don't know what the tree-sitter parse tree looks like. AdaCore provides a parser similar in purpose to the wisi parsers (https://github.com/AdaCore/libadalang), that also does more of what an Ada compiler does (which could allow even better font-lock and navigation). To support those additional operations, the syntax tree is quite different from the ada-mode one. In general, each parser library, and even each grammar author, will have different representations for the syntax tree. So if we want to support different parsers, I think it is best to define the Emacs "parser API" as "give text to parser; accept text properties from parser". LSP (via eglot) provides other things the parser can return; code completion menus, for example. And for indent and face, it returns formatted text with markdown. I plan to translate that to text properties to integrate LSP into wisi. Whether LSP requires a full initial parse is up to the LSP server author (LSP itself provides both "here's the full text" and "here's partial text" messages); they have the same considerations discussed above. > And yes, doing this by consing strings is not a good idea, it will > slow things down and cause a lot of GC. It is best avoided. Thus my > questions above. I'm not sure how "convert syntax tree to elisp" compares to "consing strings". I would certainly expect it to cause a lot of GC. >> > Btw, what do you do with the tree returned by the tree-sitter parser? >> > store it in some buffer-local variable? If so, how much memory does >> > such a tree take, and when, if ever, is that memory released? >> > >> >> It's stored in a buffer-local variable. I haven't measured the memory >> they take. Memory is released when the tree object is garbage-collected >> (it's a `user-ptr'). Is it an elisp structure (or accesible from elisp)? Have you written code that traverses it to provide faces and indentation? -- -- Stephe PS; I have the beginnings of a migraine while typing this, so some of it may not make sense. Sigh.