From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Andrew De Angelis Newsgroups: gmane.emacs.devel Subject: Re: treesit: how to get it to parse multiple languages Date: Sun, 10 Nov 2024 09:35:40 -0500 Message-ID: References: <868qtzw6jh.fsf@gnu.org> <5F722FF0-EE05-4259-A222-C69526C8C37F@gmail.com> Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="000000000000a14bd706268fe6f1" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="16814"; mail-complaints-to="usenet@ciao.gmane.io" Cc: Eli Zaretskii , emacs-devel@gnu.org To: Yuan Fu Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Sun Nov 10 15:36:44 2024 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1tA93U-0004HU-D9 for ged-emacs-devel@m.gmane-mx.org; Sun, 10 Nov 2024 15:36:44 +0100 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1tA92j-0008Ie-03; Sun, 10 Nov 2024 09:35:57 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tA92h-0008IQ-OQ for emacs-devel@gnu.org; Sun, 10 Nov 2024 09:35:55 -0500 Original-Received: from mail-vk1-xa2f.google.com ([2607:f8b0:4864:20::a2f]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1tA92f-00066Q-Ej; Sun, 10 Nov 2024 09:35:55 -0500 Original-Received: by mail-vk1-xa2f.google.com with SMTP id 71dfb90a1353d-513d1a9552cso1473346e0c.1; Sun, 10 Nov 2024 06:35:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1731249352; x=1731854152; darn=gnu.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=DkYE5KsDYBYbgFaT/+1wVtBHams9VTiVz9x6RGrGd+k=; b=KD8biJg6oJf1S3lrd/SEsLDCUXzGdCF2omrKPULZ/uqGQzOA2fJ2ToSxvxZPRz1d4p 6dMqYt59jAcZd24kHpueOCI8dQ4zTyNel9W+I9JjYoZUKs5AedrYB7R/4kq6UTAIjjMN eVUpO4owB4sLG3e1nwaew47LgvUqM2k7UbRp0CzuDsAX2MWpzk67Hc/xVvakRrreFtYl N1aNV0igk1i3O8g1gUKIiqXMUXCaL008Q0aDe/hLW7Gb01OCFohOrmjR06byXeOWMbTR SX4BImxliqz1ncJVG10mBH3Jkw4Vsn4U+Oxr8QZj17Bl2Chla+KfKTECYTSSMdT7nJ2H VjQw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731249352; x=1731854152; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=DkYE5KsDYBYbgFaT/+1wVtBHams9VTiVz9x6RGrGd+k=; b=Ogzl4UL1IxXJ1lx3zpwNTW2tPtugoXx0oTjSILY+4Nsp0DNwaocqhFelARj3XEpQVH fCN7ZoGt8aZ9AVOHSP6ca5+9/xI9gblRtAPeZGPl1Hwz5ipw1LU5Gzr+l+/eq7xb1KLh b5NmM7Lk+0EpeGGhMpLdyGyYiP6N498W1uCBtk9I5dllUvOBhHAFZT/HV9FAl3dLFlPY QIdcQwyLNYrNzBXECqsSYeFX+pfqZ01EKDXxuORDXZAiwH04qhz5vv+nW5W5F+NJE7W2 MsCReFSYRTTac9VMR4gKIce8Ql2lczVY0NkRTu8IPE+XCk7PzptKH/hm340ibzPEcIxZ 6IKw== X-Forwarded-Encrypted: i=1; AJvYcCVqu0xYTh8GOq9vd6U9UX3vL8rHq/NDR7q84uqsdS9Px2tWuxPvEAoICZaVdcVkruAu9MfHuaAsint0iA==@gnu.org X-Gm-Message-State: AOJu0YzkwUYIpu+4OiznTLGWg+9hd7suOmD5by9n9ih6AqOgYPt+3klJ Wh4XNvbar+IKoVyzAPHjmyZH7InPfi9k68xWyMtHhxsvpCOVFQi30ytX30MFtZzkL+t/FD+hvIk 1ckQwgrKmY3zg0L4EbF6KZMpoaUE+Z+rp X-Google-Smtp-Source: AGHT+IGp5Ca44TCUBKQl90TgcqA50gengiv5V3HxFogU0k1mjsgFxZ//e/hCo+VUPvd0ZdNku6uPPPPY+AVXbdv/Tgw= X-Received: by 2002:a05:6122:32d5:b0:50d:35d9:ad5a with SMTP id 71dfb90a1353d-51401bc8a29mr8206777e0c.5.1731249351771; Sun, 10 Nov 2024 06:35:51 -0800 (PST) In-Reply-To: <5F722FF0-EE05-4259-A222-C69526C8C37F@gmail.com> Received-SPF: pass client-ip=2607:f8b0:4864:20::a2f; envelope-from=bobodeangelis@gmail.com; helo=mail-vk1-xa2f.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:325375 Archived-At: --000000000000a14bd706268fe6f1 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Thanks so much! I took a look at the Emacs 30 manual and it's a lot clearer, it's perfect! I think one thing that would truly be ideal is if there is a major mode out there that already implements multiple-language functionalities using treesitter. Seeing all the components in action would be quite helpful: the simple HTML examples are very clarifying but they can only do so much. Do you all know if such a mode exists? `(ripgrep-regexp "local-parser" source-directory)` on the master branch only shows me matches in `treesit.el` itself (and associated ChangeLog / manual). If it doesn't exist yet I'm happy to give it a knack when implementing the notebook mode. I might have run into some more questions then :) On Tue, Nov 5, 2024 at 1:47=E2=80=AFAM Yuan Fu wrote: > > > > On Nov 4, 2024, at 4:02=E2=80=AFAM, Eli Zaretskii wrote: > > > >> From: Andrew De Angelis > >> Date: Sun, 3 Nov 2024 13:28:57 -0500 > >> > >> I'm trying to get a better understanding of treesit.el, and I've > stumbled on a couple of things that make me > >> think the manual is either outdated/faulty, or just not entirely clear > and I'm missing something. > >> > >> The latter is most likely, but I'd appreciate any help in figuring out > what exactly is wrong in my > >> approach/setup. I would be happy to contribute to the manual, if > needed, to ensure it is clearer. > >> > >> This is the relevant section of the manual: > >> > https://www.gnu.org/software/emacs/manual/html_node/elisp/Multiple-Langua= ges.html > >> I've started out with simply trying to recreate the setup described in > the manual, but I've run into some > >> issues. > >> Here's what I've done so far: > >> - I've defined a very simple `html-ts-mode`, using the elisp functions > from the manual: > >> https://github.com/andrewdea/poc-html-ts-mode/blob/main/html-ts-mode.e= l > >> - I activate this mode when visiting the example.html file (which is > also copied from the manual): > >> https://github.com/andrewdea/poc-html-ts-mode/blob/main/example.html > >> - the queries seem to be working as expected: when I'm in a buffer > visiting example.html, evaluating > >> `(treesit-query-capture 'html css-query)` and `(treesit-query-capture > 'html js-query)` return the expected > >> nodes > >> - ISSUE: `treesit-update-ranges` doesn't seem to be working as > expected: even if I call it multiple times, the > >> parser for the whole buffer seems to still be 'html. > `(treesit-language-at (point))` always returns 'html, even > >> when I'm inside the nodes captured by the css-query or js-query. > >> > >> Some additional context: the reason I'm looking into tree-sitter (and > its functionalities to support multiple > >> languages) is to potentially use it to fontify markdown code blocks an= d > to improve emacs support for python > >> notebooks. For markdown, I was trying a similar approach to the HTML > one described in the manual, but ran > >> into other similar issues: > >> > https://www.reddit.com/r/emacs/comments/1gcrv8k/syntaxhighlighting_codebl= ocks_in_markdown/ > . > >> I'm just including this as context. > >> > >> Let me know if any of this is not clear. > >> > >> Thanks in advance for all your help! > > > > Yuan, can you help Andrew? > > Ah yes, thanks for the ping. Andrew, I take that your problem is with > treesit-language-at, right? Specifically, it doesn=E2=80=99t return expec= ted > results. That=E2=80=99s because for treesit-language-at to work, major mo= de needs > to define treesit-language-at-function. > > This confusion has came up a couple times now, evidently > treesit-language-at is not very intuitive. Hopefully it=E2=80=99ll be fix= ed by our > updated manual for Emacs 30. In Emacs 30, we define > treesit-language-at-function in the example code: > > Emacs automates this process in =E2=80=98treesit-update-ranges=E2=80= =99. A > multi-language major mode should set =E2=80=98treesit-range-settings=E2= =80=99 so that > =E2=80=98treesit-update-ranges=E2=80=99 knows how to perform this process= automatically. > Major modes should use the helper function =E2=80=98treesit-range-rules= =E2=80=99 to > generate a value that can be assigned to =E2=80=98treesit-range-settings= =E2=80=99. The > settings in the following example directly translate into operations > shown above. > > (setq treesit-range-settings > (treesit-range-rules > :embed 'javascript > :host 'html > '((script_element (raw_text) @capture)) > :embed 'css > :host 'html > '((style_element (raw_text) @capture)))) > > ;; Major modes with multiple languages should always set > ;; `treesit-language-at-point-function' (which see). > (setq treesit-language-at-point-function > (lambda (pos) > (let* ((node (treesit-node-at pos 'html)) > (parent (treesit-node-parent node))) > (cond > ((and node parent > (equal (treesit-node-type node) "raw_text") > (equal (treesit-node-type parent) "script_element")= ) > 'javascript) > ((and node parent > (equal (treesit-node-type node) "raw_text") > (equal (treesit-node-type parent) "style_element")) > 'css) > (t 'html))))) > > And FYI, in Emacs 30 we added local parsers, that might make implementing > code/markdown blocks in a notebook easier. > > Yuan --000000000000a14bd706268fe6f1 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Thanks so much!

I took a look at th= e Emacs 30 manual and it's a lot clearer, it's perfect!
I think one thing that would truly be ideal is if there is a major mode = out there that already implements multiple-language functionalities using t= reesitter. Seeing all the components in action would be quite helpful: the = simple HTML examples are very clarifying but they can only do so much.
<= /div>

Do you all know if such a mode exists? `(ripgrep-regexp "= ;local-parser" source-directory)` on the master branch only shows me m= atches in `treesit.el` itself (and associated ChangeLog / manual).
If it doesn't exist yet I'm happy to give it a knack when imp= lementing the notebook mode. I might have run into some more questions then= :)

On Tue, Nov 5, 2024 at 1:47=E2=80=AFAM Yuan Fu <casouri@gmail.com> wrote:


> On Nov 4, 2024, at 4:02=E2=80=AFAM, Eli Zaretskii <eliz@gnu.org> wrote:
>
>> From: Andrew De Angelis <bobodeangelis@gmail.com>
>> Date: Sun, 3 Nov 2024 13:28:57 -0500
>>
>> I'm trying to get a better understanding of treesit.el, and I&= #39;ve stumbled on a couple of things that make me
>> think the manual is either outdated/faulty, or just not entirely c= lear and I'm missing something.
>>
>> The latter is most likely, but I'd appreciate any help in figu= ring out what exactly is wrong in my
>> approach/setup. I would be happy to contribute to the manual, if n= eeded, to ensure it is clearer.
>>
>> This is the relevant section of the manual:
>> https://ww= w.gnu.org/software/emacs/manual/html_node/elisp/Multiple-Languages.html=
>> I've started out with simply trying to recreate the setup desc= ribed in the manual, but I've run into some
>> issues.
>> Here's what I've done so far:
>> - I've defined a very simple `html-ts-mode`, using the elisp f= unctions from the manual:
>> https://github.com/a= ndrewdea/poc-html-ts-mode/blob/main/html-ts-mode.el
>> - I activate this mode when visiting the example.html file (which = is also copied from the manual):
>> https://github.com/andr= ewdea/poc-html-ts-mode/blob/main/example.html
>> - the queries seem to be working as expected: when I'm in a bu= ffer visiting example.html, evaluating
>> `(treesit-query-capture 'html css-query)` and `(treesit-query-= capture 'html js-query)` return the expected
>> nodes
>> - ISSUE: `treesit-update-ranges` doesn't seem to be working as= expected: even if I call it multiple times, the
>> parser for the whole buffer seems to still be 'html. `(treesit= -language-at (point))` always returns 'html, even
>> when I'm inside the nodes captured by the css-query or js-quer= y.
>>
>> Some additional context: the reason I'm looking into tree-sitt= er (and its functionalities to support multiple
>> languages) is to potentially use it to fontify markdown code block= s and to improve emacs support for python
>> notebooks. For markdown, I was trying a similar approach to the HT= ML one described in the manual, but ran
>> into other similar issues:
>> h= ttps://www.reddit.com/r/emacs/comments/1gcrv8k/syntaxhighlighting_codeblock= s_in_markdown/.
>> I'm just including this as context.
>>
>> Let me know if any of this is not clear.
>>
>> Thanks in advance for all your help!
>
> Yuan, can you help Andrew?

Ah yes, thanks for the ping. Andrew, I take that your problem is with trees= it-language-at, right? Specifically, it doesn=E2=80=99t return expected res= ults. That=E2=80=99s because for treesit-language-at to work, major mode ne= eds to define treesit-language-at-function.

This confusion has came up a couple times now, evidently treesit-language-a= t is not very intuitive. Hopefully it=E2=80=99ll be fixed by our updated ma= nual for Emacs 30. In Emacs 30, we define treesit-language-at-function in t= he example code:

=C2=A0 =C2=A0Emacs automates this process in =E2=80=98treesit-update-ranges= =E2=80=99.=C2=A0 A
multi-language major mode should set =E2=80=98treesit-range-settings=E2=80= =99 so that
=E2=80=98treesit-update-ranges=E2=80=99 knows how to perform this process a= utomatically.
Major modes should use the helper function =E2=80=98treesit-range-rules=E2= =80=99 to
generate a value that can be assigned to =E2=80=98treesit-range-settings=E2= =80=99.=C2=A0 The
settings in the following example directly translate into operations
shown above.

=C2=A0 =C2=A0 =C2=A0(setq treesit-range-settings
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(treesit-range-rules
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 :embed 'javascript
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 :host 'html
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 '((script_element (raw_text) = @capture))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 :embed 'css
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 :host 'html
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 '((style_element (raw_text) @= capture))))

=C2=A0 =C2=A0 =C2=A0;; Major modes with multiple languages should always se= t
=C2=A0 =C2=A0 =C2=A0;; `treesit-language-at-point-function' (which see)= .
=C2=A0 =C2=A0 =C2=A0(setq treesit-language-at-point-function
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(lambda (pos)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(let* ((node (treesit-node-= at pos 'html))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (pare= nt (treesit-node-parent node)))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(cond
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ((and node parent =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 (equal (treesit-node-type node) "raw_text")
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 (equal (treesit-node-type parent) "script_element"))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0'javascri= pt)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ((and node parent =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 (equal (treesit-node-type node) "raw_text")
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 (equal (treesit-node-type parent) "style_element"))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0'css)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (t 'html)))))
And FYI, in Emacs 30 we added local parsers, that might make implementing c= ode/markdown blocks in a notebook easier.

Yuan
--000000000000a14bd706268fe6f1--