From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Lynn Winebarger Newsgroups: gmane.emacs.devel Subject: Re: Tree-sitter maturity Date: Sat, 4 Jan 2025 16:25:51 -0500 Message-ID: References: <67428b3d.c80a0220.2f3036.adbdSMTPIN_ADDED_BROKEN@mx.google.com> <86ldwdm7xg.fsf@gnu.org> <6765355b.c80a0220.1a6b24.3117SMTPIN_ADDED_BROKEN@mx.google.com> <00554790-CACA-4233-8846-9E091CF1F7AA@gmail.com> <86msgl2red.fsf@gnu.org> <87o710sr7y.fsf@debian-hx90.lan> <8734i9tmze.fsf@posteo.net> <86plldwb7w.fsf@gnu.org> <87ttapryxr.fsf@posteo.net> <0883EB00-3BB2-4BC8-95D1-45F4497C0526@dancol.org> <87msge8bv8.fsf@dancol.org> <6775a459.170a0220.2f3d1e.1897SMTPIN_ADDED_BROKEN@mx.google.com> <87h66emqan.fsf@dancol.org> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="740"; mail-complaints-to="usenet@ciao.gmane.io" Cc: =?UTF-8?B?QmrDtnJuIEJpZGFy?= , Philip Kaludercic , emacs-devel , Eli Zaretskii , Richard Stallman , manphiz@gmail.com To: Daniel Colascione Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Sat Jan 04 22:26:51 2025 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1tUBfX-000Adn-Az for ged-emacs-devel@m.gmane-mx.org; Sat, 04 Jan 2025 22:26:51 +0100 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1tUBev-0004Xn-W0; Sat, 04 Jan 2025 16:26:14 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tUBeq-0004XC-9K for emacs-devel@gnu.org; Sat, 04 Jan 2025 16:26:09 -0500 Original-Received: from mail-wm1-x332.google.com ([2a00:1450:4864:20::332]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1tUBen-0005BG-GR; Sat, 04 Jan 2025 16:26:07 -0500 Original-Received: by mail-wm1-x332.google.com with SMTP id 5b1f17b1804b1-436246b1f9bso21535365e9.1; Sat, 04 Jan 2025 13:26:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736025963; x=1736630763; darn=gnu.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=zwIra2gKjN83hxuiI9WOZHsNQMUKGXKBgsR1GZOlbss=; b=PiBzBof1HBzmt6vPFPsiFj8BzvSDxGoOaivKK/Ii0YOTI5exAMaPL47vON1zJbM8hI 11r0chLGiqKtImCtmynY7VnOMhT+W8kA7+aJSUpjGgnj397Oab5Uag37wF6+C7xDolQx w5uK0R9UVXrzvrvondCElaeDItREaGeSKOSRElnKiWhPHyfh/Ha91QmK4daPvBMddzxO oD4il80VxZQDi42V6XitAPUbFtpqx0bu9E31YAZLWogNsvqMpwSnBgvmw1yaW+/sfzId 4mCpbdrueZdyQtDK/RT6Pvg6xLedyrHr/PCNISNh3FymCR4UqpIoB1XKsDzvN2u+0ejf x0Yg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736025963; x=1736630763; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=zwIra2gKjN83hxuiI9WOZHsNQMUKGXKBgsR1GZOlbss=; b=Z/p44aB+sfnOcPyNesBcF3PaciJ5+lPF+8ZvjkFJAwTVnouRFIOw4SRLx50ecdvidA dGVexjWbymX1PCk84A/Kx0BONyuHU8J+b+iqtFyUrMSsRQLAqS9ZLCmjcu+xgoIiqqlO Za0kY8ePIAsT9Ie/Ism/f3UXYU7YVko+2V2nhaO35N0mfcEcw6zNfM31FdQe0wEYs/tA J3FvwYdLRQaZchlFFT0Ieh+48Ty1WQa7MhblSYYiqhI+hPXgot7LtSrHiQv0UAFS6IGh /Macya+vDTL55ugMOQtHL4sKBLcmmgqRo6FDkwDvzZ0pX958sf/Y1H1Nqd9VwOhZm1PV WjQQ== X-Forwarded-Encrypted: i=1; AJvYcCVe7DNjuXas/3w3KTDtEZbozfMGAOZLBNL0yy7UFnQr1nPrE/aQiAAwmUaheJKoMpjPpMDjr/IOADcACtY=@gnu.org, AJvYcCWEtbuxQmnP9ofa/O/8uoMPcBJrJIDm7XvmMaKDRCLqiyOqHJyhW5YYwDHqWIyjJtQyqxls@gnu.org, AJvYcCXvuOU9Ln4x2IpJRGvceJxo6+/sluEdL2uRHwyUThjtZ09kdRBX5SLLn5293eS1WTPsj2O5@gnu.org X-Gm-Message-State: AOJu0Yyu4EmoysR8TKbRyGZIU9x5jNTklb/M3p94g2K269camfthQ6hG HDyPo9/fBSUjJKhMlDCMk4GpTPcUJ/wWmy8p4xu1UhY5abGrmCiWAW36BXj0vTShtvJI7VHW243 AwBnQsffy54AqIs0c0poSRBSOpWI= X-Gm-Gg: ASbGncseMJ+lHmx4gSO1vN6NBpe7rxQt05k/ZaA8OtBvYVXUiGgVxZCF1cwmHRsg7oD UW2Wy1v9f2qO3Sk6uYjslXg0vr8b55kPJdf8sst6lSBlep8fIP8+vKw7mSdleGTFHXdBn X-Google-Smtp-Source: AGHT+IEGSVPuv70OdBG1gwU40z5D7OCqtHsss/+WhWAXCEgYKIDOwRvq3VEeL7KUw3rVBh+BBs8VOFPBgzIXnbiZHUE= X-Received: by 2002:adf:b601:0:b0:385:f7b2:aad8 with SMTP id ffacd0b85a97d-38a221e1e09mr13619281f8f.1.1736025962451; Sat, 04 Jan 2025 13:26:02 -0800 (PST) In-Reply-To: <87h66emqan.fsf@dancol.org> Received-SPF: pass client-ip=2a00:1450:4864:20::332; envelope-from=owinebar@gmail.com; helo=mail-wm1-x332.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:327687 Archived-At: On Sat, Jan 4, 2025 at 12:39=E2=80=AFPM Daniel Colascione wrote: > Lynn Winebarger writes: > > > On Wed, Jan 1, 2025 at 3:23=E2=80=AFPM Bj=C3=B6rn Bidar wrote: > >> Lynn Winebarger writes: > >> >> Tree sitter, as wonderful as it is, strikes me as a bit of a Rube > >> >> Goldberg machine architecturally: JS *and* Rust *and* C? Really? :-= ) > >> > > >> > They evidently decided to use JSON and a simple schema to specify th= e > >> > concrete grammar, instead of creating a DSL for the purpose. > >> > Javascript is just a convenient way for embedding code into JSON the > >> > same way LISP programmers use lisp to generate S-expressions. Once > >> > you have the JSON format generated, javascript is not used. > >> > > >> > The rest of the project is really composed of orthogonal components, > >> > the GLR grammar compiler (written in Rust) and the run-time GLR > >> > parsing engine, written in C. The grammar compiler produces the > >> > parsing tables in the form of C source code that is compiled togethe= r > >> > with the library for a single library per grammar, but the C library > >> > does not actually require the parsing tables to be statically known = at > >> > compile-time, at least the last I looked, unless some really obscure > >> > dependence. The procedural interface to the parser just takes a > >> > pointer to the parser table data structure at run-time. > >> > > >> > Since GLR grammars are basically arbitrary (ambiguous) LR(1) grammar= s, > >> > the parser run-time has to implement a fairly sophisticated algorith= m > >> > (graph-stacks) to be efficient. Having implemented the LALR parser > >> > generator at least 3 times in the last couple of decades (just for m= y > >> > own use), generating the parse tables looks like a lot simpler (and > >> > well-understood) problem to solve than the GLR run-time. More > >> > importantly, the efficiency of the grammar compiler is not all that > >> > critical compared to the run-time. > >> > > >> > >> Additional alernatives instead of Node are already a good alternative. > >> Using WASM as the output format also does not sound bad assuming their > >> is some abstraction from the tree-sitter library side. > > > > I'm not sure why WASM would be interesting. AFAICT, it's just another > > set of bindings to the C library, maybe with the tables compiled into > > WASM binary module (or whatever the correct term should be - I'm not a > > WASM expert). In any case, AFAIK Emacs has no particular capability > > for using WASM files as dynamic libraries in general. Maybe if Emacs > > itself was compiled to WASM, in which case I suppose the function for > > dynamically loading libraries would implicitly load such modules. > > > > OTOH, the generated WASM bindings might provide an example of using > > the tree-sitter DLL with the in-memory parse table structure not > > embedded in the tree-sitter DLL. Is that what you meant? > > I think people get too excited about WASM. It's just a 1) portable, 2) > sandboxed mechanism for running the same programs you could compile to > native code. What's in it for us? We don't need a security sandbox for > parsers. If we want to sandbox, we should do it at a higher level. > The portability aspect seems like only a minor benefit: sure, it's less > of a logistical headache to ship one prebuilt binary than to ship N for > N different architectures, but either way, you're opting into the > headache of prebuilt binaries. I'd rather dynamically build from > source, TBH. > > >> > I agree, a generic grammar capturing the structures of most > >> > programming languages would be useful. It is definitely possible to > >> > extract the syntactic/semantic concepts from C++ and Python to creat= e > >> > such a grammar, if you are willing to allow nested grammars > >> > appropriately delimited. For example, a constructor context would > >> > delimit an expression in a data language that is embedded in a > >> > constructor context that may itself have delimited value contexts > >> > where the functional/procedural grammar may appear, ad infinitum. T= he > >> > procedural and data grammars are distinct but mutually recursive. > >> > That would be if the form appeared in an rvalue-context. For l-valu= e > >> > expressions, the same constructor delimiting syntax can become a > >> > binding form, at least, with subexpressions of binding forms also > >> > being binding forms. As long as the scanner is dynamically set > >> > according to the grammar context (and recognizes/signals the closing > >> > delimiter), the grammar can be made non-ambiguous because a given > >> > character will produce context-appropriate terminal symbols. > >> > >> What kind of scanner are you referring to? Something that works like a > >> binding generator but for AST? > > > > A few years ago, I wanted a template system for this terrible > > proprietary language I was working with, so I wrote this grammar that > > could encompass that language (which, AFAICT, was only defined by > > company programmers hacking additional patterns directly into their > > hand-written parser, for which I reverse-engineered a LALR(1) > > grammar), a shell-type interpolation sublanguage, and other languages > > that stuck to the syntactic constructs allowed by Python and C++. It > > was a bear to work out, and I ended up throwing it away, anyway. But > > the point is, at the start of an interpolation context, the parser > > would switch scanner and parser tables to the language assigned to the > > scope of that interpolation context (associated with a particular > > terminal introducing that context in the "current" parser table). So > > while parsing language A, "${" might introduce an interpolation > > context for language B, "$!{" for language C, "$[" for language D, > > etc. As long as the new scanner or parser could discriminate the > > closing terminal as ending the sublanguage program and returning to > > language A context, it should work. > > > > Anyway, for that purpose, I wanted a grammar that would be flexible > > enough that I could just switch the bindings for the actions and > > mapping of terminals, not change the whole grammar, so I would only > > need to do the grammar analysis once. That being said, I never > > actually showed it could be done with multiple real terminals for a > > single meta-terminal. That is, in the previous paragraph there might > > have been a "meta-terminal" "START_INTERPOLATION_CONTEXT" that would > > expand to 3 concrete terminals (in the grammar for language A) > > "START_INTERPOLATION_B", "START_INTERPOLATION_C", > > "START_INTERPOLATION_D", so the parser would have to know which of > > those concrete terminals was being reduced to choose the right action. > > I've been waiting for the details to rot from my memory so I can start > > from scratch on a concrete grammar. > > ANTLR's lexer modes gives you a similarly powerful capability, FWIW. > > > Aside from being useful for generic templating purposes, Such a > > generic grammar would be of use for the purpose Daniel described, i.e. > > a layer of abstraction usable for almost any modern language, even in > > polyglot texts. > > Arbitrary language composition has been the holy grail for a while, yes? I'm not sure what you mean, but I was just answering Bj=C3=B6rn's question about the context. > GLR grammars are closed under composition too. Making it easier to > define tree-sitter grammars and lexers that refer to each other would be > nice. At this point, though, I think it's more important to finish the > task of making tree-sitter-based modes as usable and Emacs-y as > traditional ones than to imagine new meta-parser > description abstractions. I'm probably not being explicit enough. I only brought this up in response to your comment a few messages ago: > > > > > Some Emacs modes could ship with .js grammars sourced from upstre= am > > > > > editor-neutral projects. Other modes might just build tree sitte= r parse > > > > > tables in elisp using something vaguely like SMIE syntax. Both s= tyles > > > > > of mode would be customizable by end users, and we'd (because, I'= m a > > > > > broken record, vendor vendor vendor) we'd maintain compatibility = without > > > > > mysterious AST-change-related breakages. The relevant part of what I wrote above is identifying a grammar of symbols for syntactic-semantic concepts that can stand in relation to the concrete AST nodes produced by tree-sitter (or whatever) the same way the symbols (categories) of syntax-tables relate to characters. So elisp authors could parameterize their code over those abstract syntactic categories rather than the concrete AST nodes from tree-sitter. Isn't that what you were talking about here? The grammar I wrote was quite large, and could capture things like a constructor expression being in the middle of two assignments so it was both a destructuring bind (lvalue) and an rvalue. Fortunately I really can't recall the details of the one I derived before, but overall the categories would encompass: - C++ syntactic constructs [ covers a lot ] - generators and data comprehensions (Python and its functional forebears for these) - interpolation (shell and query languages) That covers most of the ground I'm familiar with. It won't cover unrestricted TeX, but that language's syntax is dynamic, so what can be done? Unlike syntax-table categories, I think each concrete language would have to index the generic symbols to cover the specific types in that language, i..e. a QUOTED_LITERAL might have entries for raw strings and strings with an escape syntax, as well as other types that might be available. > > The point I keep trying to make is that you can't safely update a > foo-ts-mode tree sitter grammar without updating the corresponding > foo-ts-mode Lisp. They're tightly coupled. They're not separate > programs. Same goes for nvim or whatever using TS grammars. > Even distribution packagers understand the futility of consolidating > dependencies with unstable interfaces. Currently that is the case. Good luck negotiating that. I already stated my preference, but it requires developing a binary data descriptor type for dealing with arbitrary in-memory data structures. Maybe now that pure space is close to removal I'll take a stab at it. It's the kind of thing that should really be used in implementing extensible redumping for pdumper. And redumping was never going to get accepted with pure-space involved, AFAICT. Lynn