From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.ciao.gmane.io!not-for-mail From: Jorge Javier Araya Navarro Newsgroups: gmane.emacs.devel Subject: Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) Date: Wed, 1 Apr 2020 23:19:31 -0600 Message-ID: References: <83369o1khx.fsf@gnu.org> <83imijz68s.fsf@gnu.org> <831rp7ypam.fsf@gnu.org> Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="000000000000da67cd05a247f28c" Injection-Info: ciao.gmane.io; posting-host="ciao.gmane.io:159.69.161.202"; logging-data="91301"; mail-complaints-to="usenet@ciao.gmane.io" Cc: Eli Zaretskii , emacs-devel@gnu.org To: =?UTF-8?B?VHXhuqVuLUFuaCBOZ3V54buFbg==?= Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Thu Apr 02 07:20:25 2020 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1jJsHA-000Ndf-RU for ged-emacs-devel@m.gmane-mx.org; Thu, 02 Apr 2020 07:20:25 +0200 Original-Received: from localhost ([::1]:33512 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jJsH9-0008FX-Tn for ged-emacs-devel@m.gmane-mx.org; Thu, 02 Apr 2020 01:20:23 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:37436) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jJsGa-0007eE-EA for emacs-devel@gnu.org; Thu, 02 Apr 2020 01:19:50 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1jJsGY-0006Xd-Ha for emacs-devel@gnu.org; Thu, 02 Apr 2020 01:19:48 -0400 Original-Received: from mail-wm1-x344.google.com ([2a00:1450:4864:20::344]:55023) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1jJsGY-0006TR-4e for emacs-devel@gnu.org; Thu, 02 Apr 2020 01:19:46 -0400 Original-Received: by mail-wm1-x344.google.com with SMTP id c81so1994658wmd.4 for ; Wed, 01 Apr 2020 22:19:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=esavara-cr.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=yIJC1xrIIHiCzb0Q5hJbDBpv88dapD/BwlR9jIRNV6k=; b=T4yMkxot+Zh0Kku5el4kyFPUwogLv6ecSB939bn7+qeZp/hRJnTo4MSObqmd3E53WE e9laqXswI14VR5mSRWT9dlimXCR52Qq7m9cWZbqqdnXfx98IiJn6EB0oo8h0VCoPj7qw ITtwx4hnmL4NS7c5y1zs+zKiNJfyNkdPU+Cjt6i5GBA4QFWdSA9NfYDF3TCcAzfnBgsv 7aKA4PlDLDt000OZYWf9jB4uzX2fmrTwI9eS7yoXjG0mykGNfXHWaWmtbhZxpwhksQx2 raXDwJn7GlNOAngVNNhX1ab+CJSj/u//iQk4D3ndT3RextSydXjA196hnf9WkQwzfiMi v+qg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=yIJC1xrIIHiCzb0Q5hJbDBpv88dapD/BwlR9jIRNV6k=; b=qwGdW1LpmLylNTpb4W1OI7XzEvBWlS4pSmUqaXgM7eyB8JbJFmOzFe9mEBm205bq1h FACgLaP62/6lGTAVpMdDjLXmKDhogca4qGR9e5DjkVpHJXhNnLeLN9DHhoe0Kqu8kUze w2mswWKPrm32SCJPm6uCrVv5kS4sw7HCuLVvBFrD2odljorTaZ71QwrxVepdnF0iw7Vs qe0vinXiKr4fZTGWwBNdZ+ek4nI8WxmuUtoozhePxGcWKW3JVk5kXCft0fyQFhUP+rU1 90rmd3kNVxjeUvTLly5KsnVqcJZhBOUFwxUZ0zv5qXKkFQ12miSFKjxYDxyAnLvPHN/0 iELg== X-Gm-Message-State: AGi0PuaAA8uDGhcvYrUEwlS68rcH6sVS4hdSgQUJPnLnaySPOV5qiQQD htW6xDCLQgCLtDTszkiIFbCWimacsFB4Im1AAolw/w== X-Google-Smtp-Source: APiQypJ2jPv+dKMCpfg4gPs5BNTOoD4B44z3MYFTJ+Yw15rC520wrM0eg5VgL7zvx8fMD4JHAyyvk8l923TEPVGichs= X-Received: by 2002:a1c:7216:: with SMTP id n22mr1509886wmc.41.1585804784324; Wed, 01 Apr 2020 22:19:44 -0700 (PDT) In-Reply-To: X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2a00:1450:4864:20::344 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:246239 Archived-At: --000000000000da67cd05a247f28c Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable > Also keep in mind that syntax highlighting is just one application. Other use cases usually want a full parse tree. like indentation, or so I think =F0=9F=A4=94, but indentation may be one of= those use cases. El mi=C3=A9., 1 de abril de 2020 22:22, Tu=E1=BA=A5n-Anh Nguy=E1=BB=85n escribi=C3=B3: > On Thu, Apr 2, 2020 at 2:33 AM Eli Zaretskii wrote: > > > > > From: Tu=E1=BA=A5n-Anh Nguy=E1=BB=85n > > > Date: Thu, 2 Apr 2020 00:55:45 +0700 > > > Cc: emacs-devel@gnu.org > > > > > > > Did you consider using the API where an application can provide a > > > > function to return text at a given offset? Such a function could b= e > > > > relatively easily implemented for Emacs. > > > > > > > > > > I don't understand what you mean. Below I'll explain how it works > > > currently. [...] If dynamic modules have direct access to the > > > buffer text, none of the above is an issue. > > > > > > Such direct access can be enabled by something like this: > > > > > > char* (*access_buffer_text) (emacs_env *env, > > > emacs_value buffer, > > > ptrdiff_t byte_offset, > > > ptrdiff_t *size_inout); > > > > > > Of course, such an API would require extensive documentation on how i= t > > > must be used, to ensure safety and correctness. > > > > I think you are moving too fast, and keep the current implementation > > in sight too much. > > > > I'm actually moving too slow here. I have thought about this part quite > a bit, but I'm currently focusing on other things, partially because > this is not painful bottleneck. > > > What I suggest is to step back and see how such direct access, if it > > were available, could be used with tree-sitter. Let's forget about > > modules for a moment and consider tree-sitter linked with Emacs and > > capable of calling any C function in core. How would you use that? > > > > Buffer text is not exactly UTF-8, it's a superset of UTF-8. So one > > question to answer is what to do with byte sequences that are not > > valid UTF-8. Any suggestions or ideas? How does tree-sitter handle > > invalid byte sequences in general? > > > > I haven't checked yet. It will probably bail out, which is usually the > desired behavior. The tree-sitter's author is likely open to making this > behavior configurable here, though. Alternatively, the direct access > function can offer different behaviors: as-is, bail-out, skip-over, or > null-out (tree-sitter will skip over null bytes, IIRC). > > > Also, direct access to buffer text generally means we must make sure > > GC never runs as long as pointers to buffer text are lying around. > > Can any Lisp run between calls to the reader function that the > > tree-sitter parser calls to access the buffer text? If so, we need to > > take care of that issue. > > > > With direct access, no Lisp code will be run between these calls. > > > Next, I'm still asking whether parsing the whole buffer when it is > > first created is necessary. Can we pass to the parser just a small > > chunk (say, 500 bytes) of the buffer around the window-full to be > > displayed next? If this presents problems, what are those problems? > > > > In principle (not in tree-sitter ATM), and in very specific cases, yes. > IMO that's the wrong focus on a premature optimization anyway. As others > noted, even in the pathological case of xdisp.c, the performance is > acceptable. Also keep in mind that syntax highlighting is just one > application. Other use cases usually want a full parse tree. > > If we really want to tackle this issue, there are other approaches to > consider, e.g. background parsing, or parsing up until a time limit, and > resume parsing when Emacs is idle. Tree-sitter's API supports the > latter. > > But again, both thought exercises and my usage so far point to this > being a non-issue. > > > IOW, the issue with exposing access to buffer text to modules is IMO > > secondary. My suggestion is first to figure out how to do this stuff > > efficiently from within Emacs itself, as if the module interface were > > not part of the equation. We can add that aspect back later. > > > > My opinion is that it's better to experiment with this kind of stuff > out-of-core. It can move forward faster that way, allowing more lessons > to be learned. Real lessons, involving real-world use cases, not thought > exercises. > > In a somewhat similar vein, writing emacs-tree-sitter highlighted real > issues with dynamic modules, which I'm going to write up sometime. > > > And yes, doing this by consing strings is not a good idea, it will > > slow things down and cause a lot of GC. It is best avoided. Thus my > > questions above. > > > > > > Btw, what do you do with the tree returned by the tree-sitter parse= r? > > > > store it in some buffer-local variable? If so, how much memory doe= s > > > > such a tree take, and when, if ever, is that memory released? > > > > > > > > > > It's stored in a buffer-local variable. I haven't measured the memory > > > they take. Memory is released when the tree object is garbage-collect= ed > > > (it's a `user-ptr'). > > > > So if I have many hundreds of buffers, I could have such a tree in > > each one of them indefinitely? Perhaps that's one more design issue > > to consider, given that the parsing is so fast. Similar to what we do > > with image and face caches -- we flush them from time to time, to keep > > the memory footprint in check. So a buffer that was not current more > > than some time interval ago could have its tree GCed. > > > > That can work. Alternatively, tree-sitter can add support for "folding" > subtrees, as Stefan suggested. > > -- > Tu=E1=BA=A5n-Anh Nguy=E1=BB=85n > Software Engineer > > --000000000000da67cd05a247f28c Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
>=C2=A0=C2=A0Also keep in mind that syntax highlighting is just one
application. Other use cases usually wa= nt a full parse tree.

like indentation, or so I thin= k =F0=9F=A4=94, but indentation may be one of those use cases.
=

= El mi=C3=A9., 1 de abril de 2020 22:22, Tu=E1=BA=A5n-Anh Nguy=E1=BB=85n <= ;ubolonton@gmail.com> escribi= =C3=B3:
On Thu, Apr 2, 2020 at 2:33= AM Eli Zaretskii <eliz@gnu.org> wrote:
>
> > From: Tu=E1=BA=A5n-Anh Nguy=E1=BB=85n <ubolonton@gmail.com>
> > Date: Thu, 2 Apr 2020 00:55:45 +0700
> > Cc:
emacs-devel@gnu.org
> >
> > > Did you consider using the API where an application can prov= ide a
> > > function to return text at a given offset?=C2=A0 Such a func= tion could be
> > > relatively easily implemented for Emacs.
> > >
> >
> > I don't understand what you mean. Below I'll explain how = it works
> > currently.=C2=A0 [...]=C2=A0 If dynamic modules have direct acces= s to the
> > buffer text, none of the above is an issue.
> >
> > Such direct access can be enabled by something like this:
> >
> >=C2=A0 =C2=A0 =C2=A0char* (*access_buffer_text) (emacs_env *env, > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 emacs_value buffer, > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ptrdiff_t byte_offset,=
> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ptrdiff_t *size_inout)= ;
> >
> > Of course, such an API would require extensive documentation on h= ow it
> > must be used, to ensure safety and correctness.
>
> I think you are moving too fast, and keep the current implementation > in sight too much.
>

I'm actually moving too slow here. I have thought about this part quite=
a bit, but I'm currently focusing on other things, partially because this is not painful bottleneck.

> What I suggest is to step back and see how such direct access, if it > were available, could be used with tree-sitter.=C2=A0 Let's forget= about
> modules for a moment and consider tree-sitter linked with Emacs and > capable of calling any C function in core.=C2=A0 How would you use tha= t?
>
> Buffer text is not exactly UTF-8, it's a superset of UTF-8.=C2=A0 = So one
> question to answer is what to do with byte sequences that are not
> valid UTF-8.=C2=A0 Any suggestions or ideas?=C2=A0 How does tree-sitte= r handle
> invalid byte sequences in general?
>

I haven't checked yet. It will probably bail out, which is usually the<= br> desired behavior. The tree-sitter's author is likely open to making thi= s
behavior configurable here, though. Alternatively, the direct access
function can offer different behaviors: as-is, bail-out, skip-over, or
null-out (tree-sitter will skip over null bytes, IIRC).

> Also, direct access to buffer text generally means we must make sure > GC never runs as long as pointers to buffer text are lying around.
> Can any Lisp run between calls to the reader function that the
> tree-sitter parser calls to access the buffer text?=C2=A0 If so, we ne= ed to
> take care of that issue.
>

With direct access, no Lisp code will be run between these calls.

> Next, I'm still asking whether parsing the whole buffer when it is=
> first created is necessary.=C2=A0 Can we pass to the parser just a sma= ll
> chunk (say, 500 bytes) of the buffer around the window-full to be
> displayed next?=C2=A0 If this presents problems, what are those proble= ms?
>

In principle (not in tree-sitter ATM), and in very specific cases, yes.
IMO that's the wrong focus on a premature optimization anyway. As other= s
noted, even in the pathological case of xdisp.c, the performance is
acceptable. Also keep in mind that syntax highlighting is just one
application. Other use cases usually want a full parse tree.

If we really want to tackle this issue, there are other approaches to
consider, e.g. background parsing, or parsing up until a time limit, and resume parsing when Emacs is idle. Tree-sitter's API supports the
latter.

But again, both thought exercises and my usage so far point to this
being a non-issue.

> IOW, the issue with exposing access to buffer text to modules is IMO > secondary.=C2=A0 My suggestion is first to figure out how to do this s= tuff
> efficiently from within Emacs itself, as if the module interface were<= br> > not part of the equation.=C2=A0 We can add that aspect back later.
>

My opinion is that it's better to experiment with this kind of stuff out-of-core. It can move forward faster that way, allowing more lessons
to be learned. Real lessons, involving real-world use cases, not thought exercises.

In a somewhat similar vein, writing emacs-tree-sitter highlighted real
issues with dynamic modules, which I'm going to write up sometime.

> And yes, doing this by consing strings is not a good idea, it will
> slow things down and cause a lot of GC.=C2=A0 It is best avoided.=C2= =A0 Thus my
> questions above.
>
> > > Btw, what do you do with the tree returned by the tree-sitte= r parser?
> > > store it in some buffer-local variable?=C2=A0 If so, how muc= h memory does
> > > such a tree take, and when, if ever, is that memory released= ?
> > >
> >
> > It's stored in a buffer-local variable. I haven't measure= d the memory
> > they take. Memory is released when the tree object is garbage-col= lected
> > (it's a `user-ptr').
>
> So if I have many hundreds of buffers, I could have such a tree in
> each one of them indefinitely?=C2=A0 Perhaps that's one more desig= n issue
> to consider, given that the parsing is so fast.=C2=A0 Similar to what = we do
> with image and face caches -- we flush them from time to time, to keep=
> the memory footprint in check.=C2=A0 So a buffer that was not current = more
> than some time interval ago could have its tree GCed.
>

That can work. Alternatively, tree-sitter can add support for "folding= "
subtrees, as Stefan suggested.

--
Tu=E1=BA=A5n-Anh Nguy=E1=BB=85n
Software Engineer

--000000000000da67cd05a247f28c--