From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.io!.POSTED.ciao.gmane.io!not-for-mail
From: Stephen Leake <stephen_leake@stephe-leake.org>
Newsgroups: gmane.emacs.devel
Subject: Re: Reliable after-change-functions (via: Using incremental parsing
 in Emacs)
Date: Wed, 01 Apr 2020 15:38:26 -0800
Message-ID: <86wo6yhj4d.fsf@stephe-leake.org>
References: <CAOGL6j4q8ouVE+JYdK5PGNHsCTrjVRHswo6mmGt6XEN7HDPFfg@mail.gmail.com>
 <83369o1khx.fsf@gnu.org>
 <CAOGL6j78K5+u3Vyybgj_HisbvLLK3BJJqVdrjU-uN7sgAKMj-g@mail.gmail.com>
 <83imijz68s.fsf@gnu.org>
 <CAOGL6j7X2PEbSV13rKM6j77trHJ-dij2ivssPqYNKniC0NgDVQ@mail.gmail.com>
 <831rp7ypam.fsf@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: ciao.gmane.io; posting-host="ciao.gmane.io:159.69.161.202";
	logging-data="49693"; mail-complaints-to="usenet@ciao.gmane.io"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.2 (windows-nt)
To: emacs-devel <emacs-devel@gnu.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Thu Apr 02 01:39:37 2020
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane-mx.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.92)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>)
	id 1jJmxN-000CpB-5l
	for ged-emacs-devel@m.gmane-mx.org; Thu, 02 Apr 2020 01:39:37 +0200
Original-Received: from localhost ([::1]:38610 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>)
	id 1jJmxM-00045N-8X
	for ged-emacs-devel@m.gmane-mx.org; Wed, 01 Apr 2020 19:39:36 -0400
Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:43664)
 by lists.gnu.org with esmtp (Exim 4.90_1)
 (envelope-from <stephen_leake@stephe-leake.org>) id 1jJmwO-0003KL-G3
 for emacs-devel@gnu.org; Wed, 01 Apr 2020 19:38:38 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <stephen_leake@stephe-leake.org>) id 1jJmwM-0001nE-5F
 for emacs-devel@gnu.org; Wed, 01 Apr 2020 19:38:35 -0400
Original-Received: from gateway30.websitewelcome.com ([192.185.149.4]:13096)
 by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <stephen_leake@stephe-leake.org>)
 id 1jJmwL-0001jr-1Q
 for emacs-devel@gnu.org; Wed, 01 Apr 2020 19:38:33 -0400
Original-Received: from cm14.websitewelcome.com (cm14.websitewelcome.com [100.42.49.7])
 by gateway30.websitewelcome.com (Postfix) with ESMTP id DB73A37F27
 for <emacs-devel@gnu.org>; Wed,  1 Apr 2020 18:38:30 -0500 (CDT)
Original-Received: from host2007.hostmonster.com ([67.20.76.71]) by cmsmtp with SMTP
 id JmwIj61DJXVkQJmwIjKo4I; Wed, 01 Apr 2020 18:38:30 -0500
X-Authority-Reason: nr=8
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
 d=stephe-leake.org; s=default; h=Content-Type:MIME-Version:Message-ID:
 In-Reply-To:Date:References:Subject:To:From:Sender:Reply-To:Cc:
 Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date:
 Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:
 List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive;
 bh=ilxQ5PrJaDpjMrvyHHo+WC5dDUSFCQ3s5vGDzlU6kfM=; b=XvwwBelYL404rzAib0x4ktadm
 0ojGe7PCjbz7oHXM+Xep/pmAcaDrmWhsKuWUkJRJgAtLO042Yf5f3Lpk0yvdMn3PM1cCaWNVfxxZb
 cInlggX1KxT9h17RWVv47K1IPFDzQ1936PhuZGKlZqj1NDLWRX6UMx7dhZ/M3lKvk3/cAsrmH1bDi
 8267Ae4VgHyeWG0slqUsHsm1CEg5v+OVy4gygc3jKVGnDY8IgZ0jHRMik6aQui6JX3Gls3C1Migoc
 8fSdSve+QzKm5MVYQrGoRAPkieblbSh760Xr7ullhXlNDoxMb2MpEfJgl3Jeitl1OKx2ukUVayduK
 n68coLRlw==;
Original-Received: from [76.77.182.20] (port=62711 helo=Takver4)
 by host2007.hostmonster.com with esmtpsa
 (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.92)
 (envelope-from <stephen_leake@stephe-leake.org>) id 1jJmwI-001i5T-8K
 for emacs-devel@gnu.org; Wed, 01 Apr 2020 17:38:30 -0600
In-Reply-To: <831rp7ypam.fsf@gnu.org> (Eli Zaretskii's message of "Wed, 01 Apr
 2020 22:33:05 +0300")
X-AntiAbuse: This header was added to track abuse,
 please include it with any abuse report
X-AntiAbuse: Primary Hostname - host2007.hostmonster.com
X-AntiAbuse: Original Domain - gnu.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - stephe-leake.org
X-BWhitelist: no
X-Source-IP: 76.77.182.20
X-Source-L: No
X-Exim-ID: 1jJmwI-001i5T-8K
X-Source-Sender: (Takver4) [76.77.182.20]:62711
X-Source-Auth: stephen_leake@stephe-leake.org
X-Email-Count: 1
X-Source-Cap: c3RlcGhlbGU7c3RlcGhlbGU7aG9zdDIwMDcuaG9zdG1vbnN0ZXIuY29t
X-Local-Domain: yes
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 192.185.149.4
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org
Original-Sender: "Emacs-devel"
 <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>
Xref: news.gmane.io gmane.emacs.devel:246227
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/246227>

Eli Zaretskii <eliz@gnu.org> writes:

> Also, direct access to buffer text generally means we must make sure
> GC never runs as long as pointers to buffer text are lying around.
> Can any Lisp run between calls to the reader function that the
> tree-sitter parser calls to access the buffer text?  

If the parser copies the text into an internal buffer, that reader
function should only be called once per call to the parser. Parsers used
to based around small buffers that would read in a file a chunk at a
time, but that is not necessary on any machine that can run Emacs. Since
Emacs has the entire file in memory, the parser can too.

However, if we are really trying to avoid copying text (which is very
premature optimization), then the reader function will be called many
times during parsing (to fetch each word), and possibly during the
grammar actions (to compute indent or face).

> Next, I'm still asking whether parsing the whole buffer when it is
> first created is necessary.  

To some extent, that depends on the language.

The parser must be able to complete a parse, to generate a complete
syntax tree. I'll assume no error correction for a moment; more below.

In C or C++ body files, "a complete parse" is typically one variable or
function declaration. So if Emacs can reliably find the beginning and
end of those declarations, it could pass just the ones containing the
region of interest to the parser. Tree-sitter (if it supports this at
all, or is modified to) would end up with a forest of small parse trees,
rather than one large one. They might get merged if large chunks of text
are parsed together.

In Ada and Java, and most C++ header files, "a complete parse" is a
file; it contains an Ada package spec or body, or a Java or C++ class,
or a C++ namespace.

There are many "small languages" for which "a complete parse" is similar
to a "statement". Bash shell, for example. They could pass just the
statement, but only if Emacs can reliably find the start and end (not
always easy).

It is also possible to modify the language grammar to allow smaller
pieces of code to be a complete parse; ada-mode does this, making a
single declaration or statement "a complete parse", in order to support
"partial parse". That can easily lead to errors in indent, since the
indent of the start of the text portion is unknown (ada-mode simply
assumes it is correct in the buffer).

Another reason to allow smaller code chunks to be a complete parse is to
allow parsing the code fragments that appear in grammar actions; the
ELPA package wisitoken-grammar-mode uses this for wisitoken grammar
files with Ada actions.

In sum, the short answer is "yes, you must parse the whole file, unless
your language is particularly simple".

Since we need to support the worst case, we should assume the whole file
must be parsed at least once.

> Can we pass to the parser just a small chunk (say, 500 bytes) of the
> buffer around the window-full to be displayed next? If this presents
> problems, what are those problems?

In wisi, the error correction code will fill in the missing text so a
complete parse is possible. Since some of that is guesses, the results
may not be very good. Tree-sitter also has error correction; I'm not
clear how good it is.

> IOW, the issue with exposing access to buffer text to modules is IMO
> secondary.  

yes, because copying text is fast compared to everything else going on.

> My suggestion is first to figure out how to do this stuff efficiently
> from within Emacs itself, as if the module interface were not part of
> the equation. We can add that aspect back later.

There are two times the wisi code that wraps the parser needs access to
the buffer; first to copy the text, second to add text properties
(faces, indent values, navigation markers). There are usually many text
properties output by each parse.

The positions and values of the text properties are computed by
functions that run after the complete syntax tree has been produced. In
wisi, those functions are added directly in the grammar source file
(where they are called "post-parse grammar actions"). In tree-sitter, I
assume they are called from some mode-author-written code that traverses
the syntax tree (wisi provides that internally). Except I see below that
the emacs tree-sitter package stores the syntax tree in the buffer.

One option here is to try to standardize on an elisp representation of a
syntax tree, and have both the wisi and tree-sitter parsers provide
that. Then the grammar actions could be implemented in elisp. I suspect
that would be very slow; elisp is just not good at traversing large
complex data structures. That is not just my bias showing (I _much_
prefer doing as much as possible in Ada); I first wrote the ada-mode
parser and grammar actions in elisp, and then did a complete rewrite in
Ada, gaining significant speed. Although I never considered passing the
syntax tree to elisp as a single object, so maybe that could work well.

There is no universal standard for representing "a syntax tree". In
wisi, the tree is directly produced by the LR shift and reduce
operations, and thus is very close the the grammar expressed in BNF. I
don't know what the tree-sitter parse tree looks like. AdaCore provides
a parser similar in purpose to the wisi parsers
(https://github.com/AdaCore/libadalang), that also does more of what an
Ada compiler does (which could allow even better font-lock and navigation).
To support those additional operations, the syntax tree is quite
different from the ada-mode one.

In general, each parser library, and even each grammar author, will have
different representations for the syntax tree.

So if we want to support different parsers, I think it is best to define
the Emacs "parser API" as "give text to parser; accept text properties
from parser".

LSP (via eglot) provides other things the parser can return; code
completion menus, for example. And for indent and face, it returns
formatted text with markdown. I plan to translate that to text
properties to integrate LSP into wisi. Whether LSP requires a full
initial parse is up to the LSP server author (LSP itself provides both
"here's the full text" and "here's partial text" messages); they have
the same considerations discussed above.

> And yes, doing this by consing strings is not a good idea, it will
> slow things down and cause a lot of GC.  It is best avoided.  Thus my
> questions above.

I'm not sure how "convert syntax tree to elisp" compares to "consing
strings". I would certainly expect it to cause a lot of GC.

>> > Btw, what do you do with the tree returned by the tree-sitter parser?
>> > store it in some buffer-local variable?  If so, how much memory does
>> > such a tree take, and when, if ever, is that memory released?
>> >
>> 
>> It's stored in a buffer-local variable. I haven't measured the memory
>> they take. Memory is released when the tree object is garbage-collected
>> (it's a `user-ptr').

Is it an elisp structure (or accesible from elisp)? Have you written
code that traverses it to provide faces and indentation?

-- 
-- Stephe

PS; I have the beginnings of a migraine while typing this, so some of it
may not make sense. Sigh.