From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Lynn Winebarger Newsgroups: gmane.emacs.devel Subject: Re: [SPAM UNSURE] Re: Tree-sitter introduction documentation Date: Thu, 29 Dec 2022 17:37:45 -0500 Message-ID: References: <5e0a3185-de82-b339-0fa2-956779e63d6f@cornell.edu> <868rj6vfep.fsf@gmail.com> <4895891b-e5ea-9c37-f51b-df2e479ee758@yandex.ru> <83y1qt11xq.fsf@gnu.org> <9eb013da-d0fc-8e17-c6e3-1e8f913aebfa@yandex.ru> <83pmc50xxc.fsf@gnu.org> <71cfe4e8-3bb8-b0a6-9be5-8c0a6d92cfab@yandex.ru> <83h6xg29z3.fsf@gnu.org> <838ris22n4.fsf@gnu.org> <8335901zz3.fsf@gnu.org> <87cz84y5le.fsf@posteo.net> <3F91FDEA-881A-49DB-BB52-5A0D81C004CE@gmail.com> <87k02aihrz.fsf@posteo.net> <86ce08480352653995b8@heytings.org> <86sfgxq3ph.fsf@stephe-leake.org> Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="000000000000e14b1705f0ff22c1" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="28697"; mail-complaints-to="usenet@ciao.gmane.io" Cc: Gregory Heytings , Philip Kaludercic , Yuan Fu , Stefan Monnier , Eli Zaretskii , Dmitry Gutov , Tim Cross , emacs-devel To: Stephen Leake Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Thu Dec 29 23:38:47 2022 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1pB1Xz-0007IG-Kh for ged-emacs-devel@m.gmane-mx.org; Thu, 29 Dec 2022 23:38:47 +0100 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1pB1XF-0002YB-0I; Thu, 29 Dec 2022 17:38:01 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pB1XD-0002Y2-Qq for emacs-devel@gnu.org; Thu, 29 Dec 2022 17:37:59 -0500 Original-Received: from mail-pg1-x52b.google.com ([2607:f8b0:4864:20::52b]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1pB1XC-00038X-08; Thu, 29 Dec 2022 17:37:59 -0500 Original-Received: by mail-pg1-x52b.google.com with SMTP id 79so13168164pgf.11; Thu, 29 Dec 2022 14:37:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=RY0GSEIaqoznrb8sjczXY0zwLy3mMUjveBjJZty0uSI=; b=RUYWR7iGkG22iFiOIu+c1EGcROQl6fUy1ldNdlYEOUKYt+49Zy5x+Z8mG5qzSiCP27 76nQfRSQJMiuQloAYX6zcGvTVPxgG5HUCroogG2i1/Wsp8CULOvAnEVaqDOgdHoZHbfQ sIkeyCprc3nuZ5/N8Og4+hxR/GJbDxhvP1VHKcjlyyWJ2gzXMea7ML01E9qN4wpdhrNx nFuR5Ma7YbwwDeE2wRG/nHKzS+8lgymvTd77Fsd8LxLy/+ovOlsaXei5Qtt4lnkeu7xd XG+vpR7ylUnbkhFek/tOjWJOsfYN6lPrxHf1H1jEeZKYo5V2S4T3k1RYr4tVQkwl7acx +SEA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=RY0GSEIaqoznrb8sjczXY0zwLy3mMUjveBjJZty0uSI=; b=Y7kA6lv/YYL1ceaDltpIJ7seCbxQ+f2/u9n+BryyvcqRhY/IqKPI0tPiZiCey7+eX5 AMO4uRe0bIGZQ63HJFqGnNatg8D7SOGnvq/Wov8zrMJpBNnDzQ9lf8GyhWhK5h5bPRdo 5CO241MRcoyRXDBBgE0noPYH62ibnMTU4eptJa1VttOJpPbOOKj3CDY8NXH+PziecFj/ G/IPriLc9gOJAytEnJSdn7CqKCuB4O6WCCvZYCctXOTwt5/gF+v7Gt48dB8Ys5bV7Lvv zX8ooaa7qc96dLb0vvsroxGXteXDHnuoTXttCVauwpk0A6TtqPw3KVEejyN1wmtjlZHZ AEuw== X-Gm-Message-State: AFqh2kqpUhbLBUnkGAvmdnG18rpoy+lcyGHIaoZE1zbhGYdKjmVHN5K5 80BrPxJwMhZXx2qiiXFhszTgOkDRU2aWW5dCbls= X-Google-Smtp-Source: AMrXdXsvLo+rlOGSGOg35v7boL8StdMfycWM0hLmIuOQIlRnpBvuIIeOMSxc1jTXA90qou2jmF+J96e3P+ghyoilVKY= X-Received: by 2002:a65:490c:0:b0:470:5b0d:b50e with SMTP id p12-20020a65490c000000b004705b0db50emr1987181pgs.488.1672353476068; Thu, 29 Dec 2022 14:37:56 -0800 (PST) In-Reply-To: <86sfgxq3ph.fsf@stephe-leake.org> Received-SPF: pass client-ip=2607:f8b0:4864:20::52b; envelope-from=owinebar@gmail.com; helo=mail-pg1-x52b.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:302071 Archived-At: --000000000000e14b1705f0ff22c1 Content-Type: text/plain; charset="UTF-8" On Thu, Dec 29, 2022, 4:50 PM Stephen Leake wrote: > Lynn Winebarger writes: > > > On Thu, Dec 29, 2022 at 10:28 AM Gregory Heytings > wrote: > >> > Do you know how strong the dependency on node is? As I said before, > it > >> > seems that it is possible to evaluate the grammar files that use the > DSL > >> > using something like quickjs as well > >> > > >> > >> That's not possible, no, at least not without a lot of complications > that > >> do not seem worth the price, compared to installing Node.js. And note > >> that even if that were feasible, it would only solve the first half of > the > >> problem: to transform a grammar.js file into its corresponding parser.c > >> file, you also need the tree-sitter command line program. > > > > Maybe a better question is - is it possible to adapt the semantic > > parser generators (or others in emacs) to create the ".c" files for > > use with libtreesitter? > > This is possible in principle; I've thought about doing it with the > wisitoken parser generator. > > However, the format/struture/details of the output is not documented, > and may change in future tree-sitter releases. > True, *but* each parser has as an "ABI version" constant encoded into it, and the source states the library is supposed to be backwards compatible (currently 14, the the grammar.c files I saw are at 13). So if we get something that complies with version 13, it should be fine. I've looked over a couple of the parser.c files, and they appear to be pretty standard implementations of.LR stack automata. The CLI tool supports ambiguous grammars, but if you start from one an existing LR-type of parser generator (bison or wisent, say) can handle, then that piece should be relatively straightforward. The novel feature appears to be the automatic derivation of a "flattened" AST data structure, which the library uses to construct ASTs for a given action. I did come up with my own calculation of a flattened AST structure a year or two ago, though I wasn't thrilled with some of the results I got - automatically picking a symbol to use to "break" recursive inclusion by reverting back to a pointer was tricky. I believe I went with a heuristic of converting the symbol that had the most incoming edges to a reference in each place it appeared as a field, then repeated until there were no data structures that were required to physically contain themselves. I used that rule because it seemed like it would minimize the number of symbols not allowed to appear as fields, thus maximizing the flattening effect (too well, in fact - I had to introduce some constraints to keep some of the nonlinear structure). So it might be interesting to go through tree-sitters' algorithm to see what they came up with. Lynn --000000000000e14b1705f0ff22c1 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Thu, Dec 29, 2022, 4:50 PM Stephen Leake <stephen_leake@stephe-leake.org> = wrote:
Lynn Winebarger <owineba= r@gmail.com> writes:

> On Thu, Dec 29, 2022 at 10:28 AM Gregory Heytings <gregory@heytin= gs.org> wrote:
>> > Do you know how strong the dependency on node is?=C2=A0 As I = said before, it
>> > seems that it is possible to evaluate the grammar files that = use the DSL
>> > using something like quickjs as well
>> >
>>
>> That's not possible, no, at least not without a lot of complic= ations that
>> do not seem worth the price, compared to installing Node.js.=C2=A0= And note
>> that even if that were feasible, it would only solve the first hal= f of the
>> problem: to transform a grammar.js file into its corresponding par= ser.c
>> file, you also need the tree-sitter command line program.
>
> Maybe a better question is - is it possible to adapt the semantic
> parser generators (or others in emacs) to create the ".c" fi= les for
> use with libtreesitter?

This is possible in principle; I've thought about doing it with the
wisitoken parser generator.

However, the format/struture/details of the output is not documented,
and may change in future tree-sitter releases.
=


T= rue, *but* each parser has as an "ABI version" constant encoded i= nto it, and the source states the library is supposed to be backwards compa= tible (currently 14, the the grammar.c files I saw are at 13).=C2=A0 So if = we get something that complies with version 13, it should be fine.

I've looked over a couple of= the parser.c files, and they appear to be pretty standard implementations = of.LR stack automata. The CLI tool supports ambiguous grammars, but if you = start from one an existing LR-type of parser generator (bison or wisent, sa= y) can handle, then that piece should be relatively straightforward.
<= div dir=3D"auto">The novel feature appears to be the automatic derivation o= f a "flattened" AST data structure, which the library uses to con= struct ASTs for a given action.

I did come up with my own calculation of a flattened AST structur= e a year or two ago, though I wasn't thrilled with some of the results = I got - automatically picking a symbol to use to "break" recursiv= e inclusion by reverting back to a pointer was tricky.=C2=A0 I believe I we= nt with a heuristic of converting the symbol that had the most incoming edg= es to a reference in each place it appeared as a field, then repeated until= there were no data structures that were required to physically contain the= mselves.=C2=A0 I used that rule because it seemed like it would minimize th= e number of symbols not allowed to appear as fields, thus maximizing the fl= attening effect (too well, in fact - I had to introduce some constraints to= keep some of the nonlinear structure).

So it might be interesting to go through tree-sitters' = algorithm to see what they came up with.

<= div dir=3D"auto">Lynn

--000000000000e14b1705f0ff22c1--