From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Daniel Colascione Newsgroups: gmane.emacs.devel Subject: Re: Tree-sitter maturity Date: Sat, 04 Jan 2025 16:34:39 -0500 Message-ID: <83D59D1D-852A-4687-838F-9AC0849AE50F@dancol.org> References: <86ldwdm7xg.fsf@gnu.org> <6765355b.c80a0220.1a6b24.3117SMTPIN_ADDED_BROKEN@mx.google.com> <00554790-CACA-4233-8846-9E091CF1F7AA@gmail.com> <86msgl2red.fsf@gnu.org> <87o710sr7y.fsf@debian-hx90.lan> <8734i9tmze.fsf@posteo.net> <86plldwb7w.fsf@gnu.org> <87ttapryxr.fsf@posteo.net> <0883EB00-3BB2-4BC8-95D1-45F4497C0526@dancol.org> <87msge8bv8.fsf@dancol.org> <6775a459.170a0220.2f3d1e.1897SMTPIN_ADDED_BROKEN@mx.google.com> <87h66emqan.fsf@dancol.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="32059"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: K-9 Mail for Android Cc: =?ISO-8859-1?Q?Bj=F6rn_Bidar?= , Philip Kaludercic , emacs-devel , Eli Zaretskii , Richard Stallman , manphiz@gmail.com To: Lynn Winebarger Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Sat Jan 04 22:35:38 2025 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1tUBo1-0008C5-R7 for ged-emacs-devel@m.gmane-mx.org; Sat, 04 Jan 2025 22:35:38 +0100 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1tUBnO-0005hG-AI; Sat, 04 Jan 2025 16:34:58 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tUBnK-0005gu-Id for emacs-devel@gnu.org; Sat, 04 Jan 2025 16:34:55 -0500 Original-Received: from dancol.org ([2600:3c01:e000:3d8::1]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tUBnI-00075G-7H; Sat, 04 Jan 2025 16:34:54 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=dancol.org; s=x; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Message-ID: References:In-Reply-To:Subject:CC:To:From:Date:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=RUVxtoBrsIWwdomctqIFuWb7cJvYfz+os0G+2aDwZQM=; b=HXtU0oyH3ISKHTC7EpCNiWYifd REwVsjPCFbArbxVd412IhAOAnMbAedfuUP8sNMUQddhFWwgPBWDeWZlpBnYqkiC953OsGfOdIy23Z mY6xqqUMaEA7PBV5QKBjNxhZlGcjNBSFX8V8YoVD7YjZgVM/O3X45Cdu3AnNNZQSZQcHwPda2sN35 leFX7xLdap03MAeqLcv2yYE0IGYHWY0bC7CUeABT8FNCzvUb/Vku+vXXhnO8PkRXELLF4J7OJdoKr CuztfqYf/3JqulJr+JDC0BnmQHbsYdCeLkBI9KtwhPN9NtGSdT2W5GkPAarPxZXHMFlvXoyDSV+9v XiXsvPrA==; Original-Received: from 2603-9001-4203-1ab2-755d-99b8-dc0b-6bff.inf6.spectrum.com ([2603:9001:4203:1ab2:755d:99b8:dc0b:6bff]:42668 helo=[IPv6:::1]) by dancol.org with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1tUBn9-00071j-1b; Sat, 04 Jan 2025 16:34:43 -0500 In-Reply-To: Received-SPF: pass client-ip=2600:3c01:e000:3d8::1; envelope-from=dancol@dancol.org; helo=dancol.org X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:327688 Archived-At: On January 4, 2025 4:25:51 PM EST, Lynn Winebarger = wrote: >On Sat, Jan 4, 2025 at 12:39=E2=80=AFPM Daniel Colascione wrote: >> Lynn Winebarger writes: >> >> > On Wed, Jan 1, 2025 at 3:23=E2=80=AFPM Bj=C3=B6rn Bidar wrote: >> >> Lynn Winebarger writes: >> >> >> Tree sitter, as wonderful as it is, strikes me as a bit of a Rube >> >> >> Goldberg machine architecturally: JS *and* Rust *and* C? Really? = :-) >> >> > >> >> > They evidently decided to use JSON and a simple schema to specify = the >> >> > concrete grammar, instead of creating a DSL for the purpose=2E >> >> > Javascript is just a convenient way for embedding code into JSON t= he >> >> > same way LISP programmers use lisp to generate S-expressions=2E O= nce >> >> > you have the JSON format generated, javascript is not used=2E >> >> > >> >> > The rest of the project is really composed of orthogonal component= s, >> >> > the GLR grammar compiler (written in Rust) and the run-time GLR >> >> > parsing engine, written in C=2E The grammar compiler produces the >> >> > parsing tables in the form of C source code that is compiled toget= her >> >> > with the library for a single library per grammar, but the C libra= ry >> >> > does not actually require the parsing tables to be statically know= n at >> >> > compile-time, at least the last I looked, unless some really obscu= re >> >> > dependence=2E The procedural interface to the parser just takes a >> >> > pointer to the parser table data structure at run-time=2E >> >> > >> >> > Since GLR grammars are basically arbitrary (ambiguous) LR(1) gramm= ars, >> >> > the parser run-time has to implement a fairly sophisticated algori= thm >> >> > (graph-stacks) to be efficient=2E Having implemented the LALR par= ser >> >> > generator at least 3 times in the last couple of decades (just for= my >> >> > own use), generating the parse tables looks like a lot simpler (an= d >> >> > well-understood) problem to solve than the GLR run-time=2E More >> >> > importantly, the efficiency of the grammar compiler is not all tha= t >> >> > critical compared to the run-time=2E >> >> > >> >> >> >> Additional alernatives instead of Node are already a good alternativ= e=2E >> >> Using WASM as the output format also does not sound bad assuming the= ir >> >> is some abstraction from the tree-sitter library side=2E >> > >> > I'm not sure why WASM would be interesting=2E AFAICT, it's just anot= her >> > set of bindings to the C library, maybe with the tables compiled into >> > WASM binary module (or whatever the correct term should be - I'm not = a >> > WASM expert)=2E In any case, AFAIK Emacs has no particular capabilit= y >> > for using WASM files as dynamic libraries in general=2E Maybe if Ema= cs >> > itself was compiled to WASM, in which case I suppose the function for >> > dynamically loading libraries would implicitly load such modules=2E >> > >> > OTOH, the generated WASM bindings might provide an example of using >> > the tree-sitter DLL with the in-memory parse table structure not >> > embedded in the tree-sitter DLL=2E Is that what you meant? >> >> I think people get too excited about WASM=2E It's just a 1) portable, = 2) >> sandboxed mechanism for running the same programs you could compile to >> native code=2E What's in it for us? We don't need a security sandbox = for >> parsers=2E If we want to sandbox, we should do it at a higher level=2E >> The portability aspect seems like only a minor benefit: sure, it's less >> of a logistical headache to ship one prebuilt binary than to ship N for >> N different architectures, but either way, you're opting into the >> headache of prebuilt binaries=2E I'd rather dynamically build from >> source, TBH=2E >> >> >> > I agree, a generic grammar capturing the structures of most >> >> > programming languages would be useful=2E It is definitely possibl= e to >> >> > extract the syntactic/semantic concepts from C++ and Python to cre= ate >> >> > such a grammar, if you are willing to allow nested grammars >> >> > appropriately delimited=2E For example, a constructor context wou= ld >> >> > delimit an expression in a data language that is embedded in a >> >> > constructor context that may itself have delimited value contexts >> >> > where the functional/procedural grammar may appear, ad infinitum= =2E The >> >> > procedural and data grammars are distinct but mutually recursive= =2E >> >> > That would be if the form appeared in an rvalue-context=2E For l-= value >> >> > expressions, the same constructor delimiting syntax can become a >> >> > binding form, at least, with subexpressions of binding forms also >> >> > being binding forms=2E As long as the scanner is dynamically set >> >> > according to the grammar context (and recognizes/signals the closi= ng >> >> > delimiter), the grammar can be made non-ambiguous because a given >> >> > character will produce context-appropriate terminal symbols=2E >> >> >> >> What kind of scanner are you referring to? Something that works like= a >> >> binding generator but for AST? >> > >> > A few years ago, I wanted a template system for this terrible >> > proprietary language I was working with, so I wrote this grammar that >> > could encompass that language (which, AFAICT, was only defined by >> > company programmers hacking additional patterns directly into their >> > hand-written parser, for which I reverse-engineered a LALR(1) >> > grammar), a shell-type interpolation sublanguage, and other languages >> > that stuck to the syntactic constructs allowed by Python and C++=2E = It >> > was a bear to work out, and I ended up throwing it away, anyway=2E B= ut >> > the point is, at the start of an interpolation context, the parser >> > would switch scanner and parser tables to the language assigned to th= e >> > scope of that interpolation context (associated with a particular >> > terminal introducing that context in the "current" parser table)=2E = So >> > while parsing language A, "${" might introduce an interpolation >> > context for language B, "$!{" for language C, "$[" for language D, >> > etc=2E As long as the new scanner or parser could discriminate the >> > closing terminal as ending the sublanguage program and returning to >> > language A context, it should work=2E >> > >> > Anyway, for that purpose, I wanted a grammar that would be flexible >> > enough that I could just switch the bindings for the actions and >> > mapping of terminals, not change the whole grammar, so I would only >> > need to do the grammar analysis once=2E That being said, I never >> > actually showed it could be done with multiple real terminals for a >> > single meta-terminal=2E That is, in the previous paragraph there mig= ht >> > have been a "meta-terminal" "START_INTERPOLATION_CONTEXT" that would >> > expand to 3 concrete terminals (in the grammar for language A) >> > "START_INTERPOLATION_B", "START_INTERPOLATION_C", >> > "START_INTERPOLATION_D", so the parser would have to know which of >> > those concrete terminals was being reduced to choose the right action= =2E >> > I've been waiting for the details to rot from my memory so I can star= t >> > from scratch on a concrete grammar=2E >> >> ANTLR's lexer modes gives you a similarly powerful capability, FWIW=2E >> >> > Aside from being useful for generic templating purposes, Such a >> > generic grammar would be of use for the purpose Daniel described, i= =2Ee=2E >> > a layer of abstraction usable for almost any modern language, even in >> > polyglot texts=2E >> >> Arbitrary language composition has been the holy grail for a while, yes= ? > >I'm not sure what you mean, but I was just answering Bj=C3=B6rn's questio= n >about the context=2E > >> GLR grammars are closed under composition too=2E Making it easier to >> define tree-sitter grammars and lexers that refer to each other would b= e >> nice=2E At this point, though, I think it's more important to finish t= he >> task of making tree-sitter-based modes as usable and Emacs-y as >> traditional ones than to imagine new meta-parser >> description abstractions=2E > >I'm probably not being explicit enough=2E I only brought this up in >response to your comment a few messages ago: > >> > > > > Some Emacs modes could ship with =2Ejs grammars sourced from up= stream >> > > > > editor-neutral projects=2E Other modes might just build tree s= itter parse >> > > > > tables in elisp using something vaguely like SMIE syntax=2E Bo= th styles >> > > > > of mode would be customizable by end users, and we'd (because, = I'm a >> > > > > broken record, vendor vendor vendor) we'd maintain compatibilit= y without >> > > > > mysterious AST-change-related breakages=2E > >The relevant part of what I wrote above is identifying a grammar of >symbols for syntactic-semantic concepts that can stand in relation to >the concrete AST nodes produced by tree-sitter (or whatever) the same >way the symbols (categories) of syntax-tables relate to characters=2E >So elisp authors could parameterize their code over those abstract >syntactic categories rather than the concrete AST nodes from >tree-sitter=2E Isn't that what you were talking about here? I was talking about tightly coupling Emacs modes and their grammars and pr= oviding an easy to use way of automatically building and loading customized= grammars=2E You're going one step further, and it's a good idea=2E IIUC, y= ou're suggesting modes provide a universal, abstract, and stable DOM distil= led from the unstable concrete syntax tree produced by tree sitter or anyth= ing else=2E (The use of tree sitter would be an implementation detail of th= e mode=2E) This way, Python and Rust and whatever modes would all provide a= n implementation of the same DOM API, and nodes in this DOM would be tagged= with universal concepts like "this is a defun" or "this is a comment"=2E T= hen you'd be able to write mode agnostic code to work with this DOM=2E Some of this exists today in the sense that modes each implement one off i= nterfaces like mark-defun, text properties describing special syntax reason= s, and so on=2E The DOM approach is more general=2E I don't think we're goi= ng to get it before tree sitter becomes ubiquitous=2E