Re: treesitter local parser: huge slowdown and memory usage in a long file

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: Vincenzo Pupillo <v.pupillo@gmail.com>
To: 付禹安 <casouri@gmail.com>, "Dmitry Gutov" <dmitry@gutov.dev>,
	emacs-devel@gnu.org
Cc: "Ergus via Emacs development discussions." <emacs-devel@gnu.org>,
	Yuan Fu <casouri@gmail.com>
Subject: Re: treesitter local parser: huge slowdown and memory usage in a long file
Date: Sat, 20 Apr 2024 21:14:14 +0200	[thread overview]
Message-ID: <6049335.lOV4Wx5bFT@fedora> (raw)
In-Reply-To: <2DB11528-C657-4AC1-A143-A13B1EAC897A@gmail.com>

Great job!
I tried your new patch with my usual benchmark, tcpdf.php, and my php-ts-mode 
and it works very well!
Thank you very much
V.

In data sabato 20 aprile 2024 04:18:53 CEST, Yuan Fu ha scritto:
> > > On Feb 18, 2024, at 9:53 PM, Yuan Fu <casouri@gmail.com> wrote:
> > >> On Feb 17, 2024, at 7:37 PM, Dmitry Gutov <dmitry@gutov.dev> wrote:
> > >> 
> > >> On 13/02/2024 10:08, Yuan Fu wrote:
> > >>>> On 12/02/2024 06:16, Yuan Fu wrote:
> > >>>>> Thanks, the culprit is the call to treesit-update-ranges in
> > >>>>> treesit--pre-redisplay, where we don’t pass it any specific range,
> > >>>>> so it
> > >>>>> updates the range for the whole buffer. Eli, is there any way to get
> > >>>>> a
> > >>>>> rough estimate the range that redisplay is refreshing? Do you think
> > >>>>> something like this would work?
> > >>>> 
> > >>>> If we don't update the ranges outside of some interval surrounding
> > >>>> the
> > >>>> window, what does that mean for correctness?
> > >>> 
> > >>> If the place of update and the embedded code currently in view belong
> > >>> to
> > >>> the same node in the host language, then when we update ranges for the
> > >>> current window-visible range, the whole node’s range is updated. So at
> > >>> least for this node, the range is correct.
> > >>> If the place of update and the embedded code currently in view belong
> > >>> to
> > >>> different nodes in the host language, then when we update ranges for
> > >>> the
> > >>> current window-visible range, only the visible node’s range is
> > >>> updated.
> > >> 
> > >> Okay. What about positions after the visible part of the buffer? Can
> > >> their
> > >> ranges be outdated? It's probably okay when the ranges are only used
> > >> for
> > >> font-lock and syntax-ppss, but I wonder about possible other
> > >> applications
> > >> (reindenting the whole buffer, for example).
> > > 
> > > It’s the same as positions before the visible part. For reindenting the
> > > whole buffer, treesit-indent-region will update the range for the whole
> > > buffer at the very beginning.
> > > 
> > >>>> Perhaps the mode has a syntax-propertize-function which behaves
> > >>>> differently (as it should) depending on the language at point. Or
> > >>>> different ranges have different syntax tables, something like that.
> > >>>> 
> > >>>> If the ranges, after some edit (perhaps a programmatic one, performed
> > >>>> far
> > >>>> from the visible area), are kept not update somewhere around the
> > >>>> beginning
> > >>>> of the buffer, do we not risk confusing the syntax-ppss parser, for
> > >>>> example?
> > >>> 
> > >>> That can happen, yes.
> > >>> 
> > >>>> Come to think of it, take treesit-indent: it only updates the ranges
> > >>>> for
> > >>>> the current line. But the line's indentation usually depends on the
> > >>>> previous buffer positions, doesn't it?
> > >>> 
> > >>> The range passed to treesit-update-ranges act as an intercepting
> > >>> range—we
> > >>> capture nodes that intercepts with the range and use them to update
> > >>> ranges.
> > >>> If the line to be indented is in an embedded language block, the whole
> > >>> block will be captured and it’s range will be given to the embedded
> > >>> language parser.
> > >>> We haven’t have any problem so far mainly because most embedded code
> > >>> blocks
> > >>> are local, and it’s rare for some edit to take place far from the
> > >>> visible
> > >>> portion which affects ranges and user expects that edit to affect the
> > >>> current visible range.
> > >>> I don’t have any great idea for a better way to update ranges right
> > >>> now.
> > >>> Let me think about that. In the meantime, I’ll push a temporary fix so
> > >>> V’s
> > >>> original problem can be solved.
> > >> 
> > >> I was thinking (since considering the same problem in mmm-mode,
> > >> actually)
> > >> that it would make sense to either plug into
> > >> syntax-propertize-function, or
> > >> have a parallel data structure similarly tracking the outdated buffer
> > >> regions, which would only update the part of the buffer which had been
> > >> modified since last time.
> > >> 
> > >> Dealing with the "remainder" of the buffer might be trickier, but maybe
> > >> some heuristic which would help detect the "no changes" case could be
> > >> implemented.> > 
> > > Yeah, something similar to syntax-ppss or jit-lock. Or maybe it can be
> > > avoided, since the current on-demand range update has been working fine,
> > > until we added treesit--pre-redisplay for syntax-ppss.
> > 
> > This is actually a bit involved, because there could be multiple layer’s
> > of
> > parsers: the host language sets range for a local parser, and the local
> > parser can set ranges for a nested-nested parser. Eg, we might have a
> > markdown parser for parsing doc-comments, and inside the markdown there
> > could be code blocks which require another level of nested parser.
> > 
> > This use-case is a bit advanced but we definitely need to support it in
> > our
> > design. And my brain is twisted by all the dependency and range. If you
> > guys has some ideas they’ll be most welcome :-)
> 
> I believe I’ve found a good way to solve this problem. I pushed the changes
> to master.
> 
> Basically I added a function treesit-parser-changed-ranges that can directly
> return the change ranges from last reparse. This means we don’t need to use
> notifiers to get those change ranges anymore. Then in
> treesit-pre-redisplay, we reparse the primary parser and get the changed
> ranges from it.
> 
> Once we have the changed ranges, we update other non-primary parser’s
> ranges, but only within the changed ranges. Originally we were updating
> those parser’s ranges on the whole buffer, which led to the slowdown. Then
> we had to use some workaround to solve this. Now the workaround isn’t
> needed anymore.
> 
> I also remove some notifier functions and moved their work into
> treesit-pre-redisplay.
> 
> Yuan

next prev parent reply	other threads:[~2024-04-20 19:14 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-20  2:18 treesitter local parser: huge slowdown and memory usage in a long file Yuan Fu
2024-04-20 19:14 ` Vincenzo Pupillo [this message]
2024-04-23  5:09   ` Yuan Fu
2024-05-06  2:04 ` Dmitry Gutov
2024-05-09  0:16   ` Yuan Fu
2024-05-12 23:44     ` Dmitry Gutov
  -- strict thread matches above, loose matches on Subject: below --
2024-02-11 21:53 Vincenzo Pupillo
2024-02-12  4:16 ` Yuan Fu
2024-02-12 14:09   ` Eli Zaretskii
2024-02-13  8:15     ` Yuan Fu
2024-02-13  9:39       ` Vincenzo Pupillo
2024-02-13 12:59       ` Eli Zaretskii
2024-02-13  0:50   ` Dmitry Gutov
2024-02-13  8:08     ` Yuan Fu
2024-02-18  3:37       ` Dmitry Gutov
2024-02-19  5:53         ` Yuan Fu
2024-03-21  6:39           ` Yuan Fu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6049335.lOV4Wx5bFT@fedora \
    --to=v.pupillo@gmail.com \
    --cc=casouri@gmail.com \
    --cc=dmitry@gutov.dev \
    --cc=emacs-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).