From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Lynn Winebarger Newsgroups: gmane.emacs.devel Subject: Re: Tree-sitter maturity Date: Tue, 31 Dec 2024 17:29:04 -0500 Message-ID: References: <1ed88fca-788a-fe9f-b6c8-edb2f49751c9@mavit.org.uk> <67428b3d.c80a0220.2f3036.adbdSMTPIN_ADDED_BROKEN@mx.google.com> <86ldwdm7xg.fsf@gnu.org> <6765355b.c80a0220.1a6b24.3117SMTPIN_ADDED_BROKEN@mx.google.com> <00554790-CACA-4233-8846-9E091CF1F7AA@gmail.com> <86msgl2red.fsf@gnu.org> <87o710sr7y.fsf@debian-hx90.lan> <8734i9tmze.fsf@posteo.net> <86plldwb7w.fsf@gnu.org> <87ttapryxr.fsf@posteo.net> <0883EB00-3BB2-4BC8-95D1-45F4497C0526@dancol.org> <87msge8bv8.fsf@dancol.org> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="39330"; mail-complaints-to="usenet@ciao.gmane.io" Cc: Philip Kaludercic , emacs-devel , Eli Zaretskii , Richard Stallman , manphiz@gmail.com To: Daniel Colascione Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Tue Dec 31 23:30:27 2024 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1tSkks-000A33-06 for ged-emacs-devel@m.gmane-mx.org; Tue, 31 Dec 2024 23:30:27 +0100 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1tSkjy-0002hn-Pv; Tue, 31 Dec 2024 17:29:30 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tSkjw-0002hQ-Ki for emacs-devel@gnu.org; Tue, 31 Dec 2024 17:29:28 -0500 Original-Received: from mail-wm1-x330.google.com ([2a00:1450:4864:20::330]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1tSkjt-0006Vp-1v; Tue, 31 Dec 2024 17:29:27 -0500 Original-Received: by mail-wm1-x330.google.com with SMTP id 5b1f17b1804b1-434f398a171so10185995e9.2; Tue, 31 Dec 2024 14:29:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1735684157; x=1736288957; darn=gnu.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=7zvtRGOWYnscCyQkiGdxgOQdw6QVYfGkFVidrxZCzQU=; b=LsvTBWrtk61ZVN6hFZxbbT+tWZb4KXdLrtVJ81sjev/z3tuN1xMXs/JyvJavLtiDP/ ZOnTI2Gv4NA6uSFKS8SMoLgMJ9FJqbc15hyQWp/bXcXT+6abkXO2pS0TuppCpHBcmfRO k57asUqjjkI0Eod00GNKtQ7dHgX5gi6SgzqoEmxacAT9X3J7qC9HZaTkKkixX8bLw0Dh 3/Hq18DULcd+BVq6NoHGSD9goFFZdS753TsnAXRiCKK8G566WHhtU8I9tAg7w3Bnid1N Lx6VLZl2FYFJq9iHguhjw2l34npFmaIuHkZ0lxUw/KeBdUADjW++hdsaqFI9iPaTzdWn 4Ewg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1735684157; x=1736288957; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=7zvtRGOWYnscCyQkiGdxgOQdw6QVYfGkFVidrxZCzQU=; b=C8kqkQKZpGu/eikY9Z/ksEiR6YuvdjnDnWZY5cP7Wu41JYMCIXa4vKEVscEfjUmUHY ZkKcQccquwNh36Z80bxXgnxmgFcp03SoyCccxuW4ZFDefvBrxIjlt/F1la6ECkmK5qjF 8MVeLEDIwa8R8Pu/EKnrm3zO0iX8rLW3Cbiepgd+Pu9OgGWMhJAN3FRTQFz7aBFjoMv1 v+LlcuTkk4h6MORYqM3DIdBYHUXX71zcrmT1unsFcX67XS+iO6GGufvM4D4t560wSWRZ +GLxwy2hbY4TaZ/U6nHOEA33lAyJCQeWkQBwN61FZzGy7s5acHExYTByvNGHQdSBzARN 4XCQ== X-Forwarded-Encrypted: i=1; AJvYcCULCOwCkAj9CcDAkDuKGFk6D+goMdUcOnkdKTPWYG4XuPJyS1+40ARLKidFUiaJlhL9CmZ6@gnu.org, AJvYcCWBcHiGssS6M8veBDq5QCq/d8XMZS6oPlYY6lEbo86e+0mqfcp8PyvdOEZjhUaSFvL7Q3+aQ1NEjR0MWgU=@gnu.org, AJvYcCXN8jvwrasMMkCVhYp/83P0fWMRzDhrgLIMTkS8AxMOfL4TgckLgLAAManCOSovya4kQ++p@gnu.org X-Gm-Message-State: AOJu0YwLUjyQdFCLyo1C4ocqZW8x824ZM80SVaXQeDKQmbqrngiO1qQP EEKlCa/+FXC7+hCxGHQsTjtrmyNYZMXPQChLpitA+8Lj91LcBkSSdNYOyhHqjWQn6TjROIQVFJS EKlPaFpBx3NDSh8UaF5oqe3fwPsA= X-Gm-Gg: ASbGncsaDl9KHKx1JTwmU0GkztdeAW0fgPjzxZPkzcXwnMXF0XqR3Itc2OLzUznyAvA RD5Nmk5wbHO6Z/Elm/sm9E8x6utCNWHAFaG1GJ5ejkFjMDS3Nhrg= X-Google-Smtp-Source: AGHT+IGJa199HJ+bnBxR6PeExA7TEKME6Q0UvaR9VKGU3kyfDOJawqTYLZkCCtoOl5bAqpTq+KJnihysgkraYD0c1Qs= X-Received: by 2002:adf:9cc4:0:b0:385:f1f2:13cf with SMTP id ffacd0b85a97d-38a221f2036mr10417387f8f.5.1735684156543; Tue, 31 Dec 2024 14:29:16 -0800 (PST) In-Reply-To: <87msge8bv8.fsf@dancol.org> Received-SPF: pass client-ip=2a00:1450:4864:20::330; envelope-from=owinebar@gmail.com; helo=mail-wm1-x330.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:327537 Archived-At: On Sun, Dec 29, 2024 at 3:37=E2=80=AFPM Daniel Colascione wrote: > > Thanks. Such an approach would let us treat tree-sitter grammars a lot > more like font-lock-keywords, and I think for some modes, that'd be a > good option. (Of course, SHTDI.) The main blocking point for me is a primitive facility for describing machine-level binary data structures, and operations for manipulating data according to those specifications. The "bindat" facility is a step in that direction, but its semantics lacks pointers, which is a big limitation for simple translation of C data structures from source code. > > Tree sitter, as wonderful as it is, strikes me as a bit of a Rube > Goldberg machine architecturally: JS *and* Rust *and* C? Really? :-) They evidently decided to use JSON and a simple schema to specify the concrete grammar, instead of creating a DSL for the purpose. Javascript is just a convenient way for embedding code into JSON the same way LISP programmers use lisp to generate S-expressions. Once you have the JSON format generated, javascript is not used. The rest of the project is really composed of orthogonal components, the GLR grammar compiler (written in Rust) and the run-time GLR parsing engine, written in C. The grammar compiler produces the parsing tables in the form of C source code that is compiled together with the library for a single library per grammar, but the C library does not actually require the parsing tables to be statically known at compile-time, at least the last I looked, unless some really obscure dependence. The procedural interface to the parser just takes a pointer to the parser table data structure at run-time. Since GLR grammars are basically arbitrary (ambiguous) LR(1) grammars, the parser run-time has to implement a fairly sophisticated algorithm (graph-stacks) to be efficient. Having implemented the LALR parser generator at least 3 times in the last couple of decades (just for my own use), generating the parse tables looks like a lot simpler (and well-understood) problem to solve than the GLR run-time. More importantly, the efficiency of the grammar compiler is not all that critical compared to the run-time. > Do you happen to know whether the subset of Rust that gccrs recognizes > is sufficient to compile the tree sitter grammar compiler? If so, we > could in principle combine gccrs with a bare-bones embedded JS > interpreter like https://duckjs.org/ to produce a mechanism that would > let us customize and rebuild tree sitter grammars as easily as we do > elisp files, even on obscure platforms like DJGPP. I have no idea. As I wrote above, replicating the calculation performed by the grammar compiler is not that intimidating, if we had a way of writing out the parse tables in the in-memory structures understood by the runtime procedural interface. At least, replicating the GLR grammar analysis is a lot simpler than implementing a compiler for Rust, if that viewpoint makes sense. At worst, Emacs could move from the generated C files to consuming the JSON files. There's no requirement that parsers even have a JS form, that's just for convenience of grammar writers. I mean, look at https://github.com/tree-sitter/tree-sitter-cpp/blob/master/grammar.js . That uses a fairly limited subset of JS that has a straightforward translation to lisp types. If we don't require replicating every corner-case of Javascript, then the existing JS tree-sitter library could probably be used to produce an S-expression that a simple set of macros could translate into an S-expression equivalent of the corresponding grammar.json. If you had a emacs-based grammar compiler that could consume grammars in JSON format, with a generic tee-sitter dynamic library (no fixed parse tables), you could even "bootstrap" using the existing JSON from https://github.com/tree-sitter/tree-sitter-javascript/blob/master/src/gramm= ar.json, so Rust was never involved (if that is important). > > Some Emacs modes could ship with .js grammars sourced from upstream > editor-neutral projects. Other modes might just build tree sitter parse > tables in elisp using something vaguely like SMIE syntax. Both styles > of mode would be customizable by end users, and we'd (because, I'm a > broken record, vendor vendor vendor) we'd maintain compatibility without > mysterious AST-change-related breakages. I agree, a generic grammar capturing the structures of most programming languages would be useful. It is definitely possible to extract the syntactic/semantic concepts from C++ and Python to create such a grammar, if you are willing to allow nested grammars appropriately delimited. For example, a constructor context would delimit an expression in a data language that is embedded in a constructor context that may itself have delimited value contexts where the functional/procedural grammar may appear, ad infinitum. The procedural and data grammars are distinct but mutually recursive. That would be if the form appeared in an rvalue-context. For l-value expressions, the same constructor delimiting syntax can become a binding form, at least, with subexpressions of binding forms also being binding forms. As long as the scanner is dynamically set according to the grammar context (and recognizes/signals the closing delimiter), the grammar can be made non-ambiguous because a given character will produce context-appropriate terminal symbols. As for vendoring, I just doubt you will get much buy-in in this forum. There are corporate-type free/open-source software projects that prioritize uniformity in build environments and limiting the scope of bugs that can arise from the build process/dependencies that vendor at the drop of the hat. Then there are "classic" free software projects that have amalgamated the work of many individual contributors, and those contributors often prioritize control of the software running on their systems for whatever reason (but eliminating non-free software is definitely one of them), and they often can/will contribute patches for that purpose. The second camp *hates* vendoring because it subverts their control of their computational resources. At least, that's the dichotomy I see. There are probably finer points I'm missing or mischaracterizing. Lynn