From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.io!.POSTED.ciao.gmane.io!not-for-mail
From: Stephen Leake <stephen_leake@stephe-leake.org>
Newsgroups: gmane.emacs.devel
Subject: Re: emacs-tree-sitter and Emacs
Date: Thu, 02 Apr 2020 17:55:36 -0800
Message-ID: <86a73tgwo7.fsf@stephe-leake.org>
References: <CAAxewyhxQek3K7yjSPXm3-7VLxLu92HpZOJKGKcL2eFbePg3dQ@mail.gmail.com>
 <83eeta3sa0.fsf@gnu.org> <86369ojbig.fsf@stephe-leake.org>
 <83lfnfz6jr.fsf@gnu.org> <864ku3htmb.fsf@stephe-leake.org>
 <83v9mix9vk.fsf@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: ciao.gmane.io; posting-host="ciao.gmane.io:159.69.161.202";
	logging-data="111074"; mail-complaints-to="usenet@ciao.gmane.io"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.2 (windows-nt)
To: emacs-devel <emacs-devel@gnu.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Fri Apr 03 03:56:40 2020
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane-mx.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.92)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>)
	id 1jKBZY-000Sni-Do
	for ged-emacs-devel@m.gmane-mx.org; Fri, 03 Apr 2020 03:56:40 +0200
Original-Received: from localhost ([::1]:49266 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>)
	id 1jKBZX-0000uR-EL
	for ged-emacs-devel@m.gmane-mx.org; Thu, 02 Apr 2020 21:56:39 -0400
Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:58680)
 by lists.gnu.org with esmtp (Exim 4.90_1)
 (envelope-from <stephen_leake@stephe-leake.org>) id 1jKBYx-0000Le-PW
 for emacs-devel@gnu.org; Thu, 02 Apr 2020 21:56:05 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <stephen_leake@stephe-leake.org>) id 1jKBYv-0001lt-6o
 for emacs-devel@gnu.org; Thu, 02 Apr 2020 21:56:02 -0400
Original-Received: from gateway24.websitewelcome.com ([192.185.50.93]:21882)
 by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <stephen_leake@stephe-leake.org>)
 id 1jKBYu-0001VZ-SR
 for emacs-devel@gnu.org; Thu, 02 Apr 2020 21:56:01 -0400
Original-Received: from cm11.websitewelcome.com (cm11.websitewelcome.com [100.42.49.5])
 by gateway24.websitewelcome.com (Postfix) with ESMTP id 9C0F63540C
 for <emacs-devel@gnu.org>; Thu,  2 Apr 2020 20:55:41 -0500 (CDT)
Original-Received: from host2007.hostmonster.com ([67.20.76.71]) by cmsmtp with SMTP
 id KBYbjQTTfSl8qKBYbjvwLv; Thu, 02 Apr 2020 20:55:41 -0500
X-Authority-Reason: nr=8
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
 d=stephe-leake.org; s=default; h=Content-Type:MIME-Version:Message-ID:
 In-Reply-To:Date:References:Subject:To:From:Sender:Reply-To:Cc:
 Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date:
 Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:
 List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive;
 bh=hJHzetl+eX0RfTkozLBirh9I9PZ+mk2KQbdEY5Cpglw=; b=dN8R94KhS3B68juBheVsyqQ1O
 DU0dTTB3DA32CG7UY+SoOg52HGcBfExDyGewm+ejVr+iPcfxbGEBRTQbt+2H/aS+ZW4DahZOuYTaG
 yIsz1MBtRLc6+xAdqle2U7hEHBaAYkSrKxKGUbqYxRZaxQIL79U9sCmgcHYN2573Q+1ZcZl+/7rie
 t1GCCV75dT7hXsC8PabUdHQAQeM37zSGCafJLcCafIT6elycnIbagFHfvXoumYYv0h0UvSZZnb51v
 wfCuf8H9fLNksfmXj27a2P9GjQRPsn0u7+SfsS90oJahskNeoDucufTZyUBZodF4gmsPTNsWfPOnT
 uLr0LUZ5Q==;
Original-Received: from [76.77.182.20] (port=63745 helo=Takver4)
 by host2007.hostmonster.com with esmtpsa
 (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.92)
 (envelope-from <stephen_leake@stephe-leake.org>) id 1jKBYa-002jDh-VH
 for emacs-devel@gnu.org; Thu, 02 Apr 2020 19:55:41 -0600
In-Reply-To: <83v9mix9vk.fsf@gnu.org> (Eli Zaretskii's message of "Thu, 02 Apr
 2020 17:03:43 +0300")
X-AntiAbuse: This header was added to track abuse,
 please include it with any abuse report
X-AntiAbuse: Primary Hostname - host2007.hostmonster.com
X-AntiAbuse: Original Domain - gnu.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - stephe-leake.org
X-BWhitelist: no
X-Source-IP: 76.77.182.20
X-Source-L: No
X-Exim-ID: 1jKBYa-002jDh-VH
X-Source-Sender: (Takver4) [76.77.182.20]:63745
X-Source-Auth: stephen_leake@stephe-leake.org
X-Email-Count: 1
X-Source-Cap: c3RlcGhlbGU7c3RlcGhlbGU7aG9zdDIwMDcuaG9zdG1vbnN0ZXIuY29t
X-Local-Domain: yes
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 192.185.50.93
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org
Original-Sender: "Emacs-devel"
 <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>
Xref: news.gmane.io gmane.emacs.devel:246294
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/246294>

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Stephen Leake <stephen_leake@stephe-leake.org>
>> Date: Wed, 01 Apr 2020 11:51:40 -0800
>> 
>> Eli Zaretskii <eliz@gnu.org> writes:
>> 
>> > Can you tell in more detail why you need to rely on these hooks?  They
>> > shouldn't be necessary, AFAIU.  
>> 
>> It is an optimization choice.
>> 
>> In an unmodified buffer, that is smaller than 100,000 characters
>> (default setting of wisi-partial-parse-threshold), the entire buffer is
>> parsed once; that applies faces to all the Ada identifiers that need
>> faces (standard font-lock regexp handles the reserved words). Then when
>> font-lock fontifies a region, no parsing is needed.
>
> But why do you need that initial full parse in the first place?  Is
> parsing parts of the buffer so much harder?

Because the parser must see a complete top level grammar statement. In
Ada, that's the whole file; a typical file looks like:

package Nifty is

    type Foo is ...;

    function Function_1 is ...;

end Nifty;

The parser needs to see all of the "package" declaration. Java and C++
header files are similar; a single class or namespace. In C++ and C body
files, there are lots of small declarations, and you could parse each
one of those independently, but _only_ if Emacs can find the start and
end of each, which is hard.

In addition, to properly compute indent, you need the fully nested
context. Computing faces usually doesn't need that, but it might in some
cases.

>> Indent is similar; the parse sets text properties holding the indent for
>> each line; indent-region then applies them.
>
> Indent is a different use case: it happens by user command, and thus
> has different time restrictions than redisplay.

Yes, but it is computed by the same parser, so it is relevant.

>> If the default setting of jit-lock-defer-time (ie nil) is used, then
>> font-lock runs immediately after each change, and the after-change hooks
>> are not needed. But as I have mentioned, I always run with
>> jit-lock-defer-time set to 1.0 (because parsing is not fast enough in
>> some cases), so the change hooks are needed.
>
> AFAIU, tree-sitter and similar parsers are supposed to be much faster,
> so the problem with slow parsing, and all the solutions to alleviate
> that problem, may not be necessary, if they are the only reason for
> using the hooks.

The main reason the ada-mode parser is too slow is the error correction.
tree-sitter appears to have less sophisticated error correction, which
will give worse results with code under edit. The ada-mode parser can be
speeded up by specifying parameters that cripple the error correction.

In addition, users will always create huge files (where "huge" means
"bigger than we've seen before"); there are always speed limits. The
reason ada-mode has partial parse is that Eurocontrol has huge files,
that they occasionally edit, and always parsing the whole file, even in
the absence of syntax errors, was too slow.

>> The alternative to not requiring after-change hooks is to always do a full
>> parse, for ever call of fontify-region or indent-region. That is far too
>> slow.
>
> Even for indentation, a full parse should not be needed.  You need to
> only parse the outermost enclosing function/procedure, right?  That's
> rarely the full buffer, except when the buffer is small.

As discussed above, that depends on your language; in Ada it is _always_
the full buffer. And finding the start of a function in C and C++ is hard.

>> Note that Tree-Sitter requires one full parse of the buffer to generate
>> the parse tree that is later updated incrementally; in an unmodified
>> buffer, only that one parse is needed.
>
> Tree-sitter cannot know what the full buffer holds, so nothing
> prevents us from passing it just part of the buffer.  After all,
> tree-sitter should be able to do a decent job when the part we pass to
> it actually _is_ all we have in the buffer, right?

Same issues as above.

>> > And they cannot pick up every relevant change; for example, what
>> > happens if some face used for font-lock is modified?
>> 
>> Yes, that is a flaw. Not likely to occur in everyday use
>
> Redisplay cannot rely on something being "unlikely", because it's
> expected to produce correct results in all situations.  

The flaw is not in ada-mode's use of a parser or after-change-functions;
it's a general problem with font-lock.

The face values are applied to the buffer text as text properties
containing the symbol that holds the face to be used; for example
(font-lock-face font-lock-function-name-face). If the contents of that
symbol change, then redisplay must be rerun to apply the correct values.
This does _not_ require a reparse; the parser sets the text property,
and that has not changed.

Use case: A c-mode buffer A is currently displayed in a window in a
frame, it is syntactically correct, and all displayed faces are correct.
In another frame, the user uses 'M-x set-variable' to change the value
of font-lock-function-name-face.

To update the display, something has to trigger redisplay of buffer A. I
don't think using M-x set-variable in a different frame does that.

Switching buffers in a frame does cause a redisplay (to update the menu
and mode line); If M-x set-variable is done in the same frame as buffer
A, the change in font-lock-function-name-face should show up as
expected.

A similar use case would be changing from "light mode" to "dark mode".
That could be done by changing the theme using load-theme; that should
force a redisplay (I assume it does; I have not checked). 

Other than the global face variables, ada-mode does not have any
variables that control faces. Some other modes may, for example setting
the level of highlighting to minimal or max. In that case, the font-lock
regexps change, and the function that does that presumably sets
fontified to nil in the current buffer, and should also force redisplay.
If ada-mode adds a feature like this, there will be a function to change
it (perhaps a custom variable change function) that also forces a
reparse and redisplay.

> I can understand why fontification methods that are too slow want to
> get some help from hooks, but when we design and implement novel
> fontification methods using fast parsers, we should first try doing
> that without any hooks,

Yes, premature optimization is evil. Using tree-sitter to implement
font-lock should start by always parsing the whole buffer for every call
of fontify-region. If that is fast enough, we're done. If not, we can
consider whether parsing a smaller part of the buffer is possible.

Note that the fact that tree-sitter provides incremental parse is a
strong hint that the answer will be "it's not fast enough".

>> >> By default font-lock runs after every character typed
>> >
>> > No, it only runs when redisplay kicks in.  If you type very quickly,
>> > it won't run for every character.  At least AFAIR.
>> 
>> What triggers redisplay?
>
> When Emacs is about to read input, if no input is available, it
> performs redisplay.  IOW, Emacs enters redisplay when it's about to
> become idle.
>
<snip>
>> The elisp manual section "Forcing redisplay" says "Emacs normally tries
>> to redisplay the screen whenever it waits for input." After I type the
>> first character, it is no longer waiting for input, it is processing
>> that character. I assume here "process that char code" includes running
>> after-change-functions, which is (small) elisp code. But I guess after
>> processing that char, before calling redisplay, it checks if there is
>> more input, which should be true if I type fast enough. Perhaps "process
>> that char code" is faster than the combination of my fingers and the
>> keyboard char send rate?
>
> Yes, most probably.

Ok, so in practice, it is not possible to type fast enough, and
font-lock runs after every character typed. 

> In other similar situations (e.g., in Flyspell mode) we wait for some
> non-zero idle time before actually running the code which could react
> to slow typing with annoying messages.

Since font-lock is running a parser, it detects syntax errors. I
could delay the display of the fringe mark, without delaying font-lock
itself. I'll put that on my list.

-- 
-- Stephe