From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: =?utf-8?Q?Bj=C3=B6rn?= Bidar Newsgroups: gmane.emacs.devel Subject: Re: Tree-sitter maturity Date: Wed, 01 Jan 2025 22:23:52 +0200 Message-ID: <8075.49878243066$1735763091@news.gmane.org> References: <1ed88fca-788a-fe9f-b6c8-edb2f49751c9@mavit.org.uk> <67428b3d.c80a0220.2f3036.adbdSMTPIN_ADDED_BROKEN@mx.google.com> <86ldwdm7xg.fsf@gnu.org> <6765355b.c80a0220.1a6b24.3117SMTPIN_ADDED_BROKEN@mx.google.com> <00554790-CACA-4233-8846-9E091CF1F7AA@gmail.com> <86msgl2red.fsf@gnu.org> <87o710sr7y.fsf@debian-hx90.lan> <8734i9tmze.fsf@posteo.net> <86plldwb7w.fsf@gnu.org> <87ttapryxr.fsf@posteo.net> <0883EB00-3BB2-4BC8-95D1-45F4497C0526@dancol.org> <87msge8bv8.fsf@dancol.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="21873"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Gnus/5.13 (Gnus v5.13) Cc: Daniel Colascione , Philip Kaludercic , emacs-devel , Eli Zaretskii , Richard Stallman , manphiz@gmail.com To: Lynn Winebarger Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Wed Jan 01 21:24:42 2025 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1tT5Gj-0005Zm-Vy for ged-emacs-devel@m.gmane-mx.org; Wed, 01 Jan 2025 21:24:42 +0100 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1tT5G4-0008F7-OZ; Wed, 01 Jan 2025 15:24:00 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tT5G2-0008Eq-JY for emacs-devel@gnu.org; Wed, 01 Jan 2025 15:23:59 -0500 Original-Received: from thaodan.de ([185.216.177.71]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tT5G0-00044w-7h; Wed, 01 Jan 2025 15:23:58 -0500 Original-Received: from odin (dsl-trebng12-50dc7b-49.dhcp.inet.fi [80.220.123.49]) by thaodan.de (Postfix) with ESMTPSA id 34535D00051; Wed, 1 Jan 2025 22:23:53 +0200 (EET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=thaodan.de; s=mail; t=1735763033; bh=eCVzmOjvQAnf5XRI3FBAE8u+Xa5tkeDhdr3dpc7zKwM=; h=From:To:Cc:Subject:In-Reply-To:References:Date; b=dYql13NQ/GnFi5cBmJ5uu67hFVVLsJ01thRvax9Fy8UbNdAFFumyMqWFt92O7Xsev gO2UVTBI+3HlzBAmODVF2h0+qBpXzZN6YpGlWDpAQC8Up0H5w1D3qzDHDnyCrSxyWm J3cq09rOUQz586yxqVohXDnWB2Lkxu3IEo9egKBB+b7ir3o0kWda1tiV1XV1ltIcZj L9jRiWO8bKTGz23lcWUIh3uD9tsRQxQWM06Oa56bSpwvb3+8JWlagmRdYbtlU9C7Ua tCb7H1UlybIEllnbw2P5cFJaKUb6AdU2By4l45TT2MZlgtzgl41ai65fgttFhdI1mt ik2v28XZk30zXBbEQJA7GbWrcYT+De9jY2bHAkpleJKnLK6mqFwCCiBPME40SAh+wj 068anAWI5eEDeUXpqB0DQbkq4w7dxcNyhePMro2HOgM30U1G+rynbB0+/YFLYqybp7 lWkGDSL4OA95a8rlGiwHDvQlb45O9NwJ9c5/G1aw64XzUXzs1RzEgajHN/EiyHzPc9 nmHZacQ/EF+m2qwq2CmKtEDGL6aqzdxFSW/YlVtta/woBaiirGAxMntQiV1hTgTc21 /F/bWn2lYNdh/Z4KyeFosTp83fn+44A+zv/Fos5wmbkN+MXXsoNDYFpPm7nkJeFpP7 o809IvUDgRABJ4xoxJW0QLDU= In-Reply-To: (Lynn Winebarger's message of "Tue, 31 Dec 2024 17:29:04 -0500") Autocrypt: addr=bjorn.bidar@thaodan.de; prefer-encrypt=nopreference; keydata= mDMEZNfpPhYJKwYBBAHaRw8BAQdACBEmr+0xwIIHZfIDlZmm7sa+lHHSb0g9FZrN6qE6ru60JUJq w7ZybiBCaWRhciA8Ympvcm4uYmlkYXJAdGhhb2Rhbi5kZT6IlgQTFgoAPgIbAwULCQgHAgIiAgYV CgkICwIEFgIDAQIeBwIXgBYhBFHxdut1RzAepymoq1wbdKFlHF9oBQJk1/YmAhkBAAoJEFwbdKFl HF9oB9cBAJoIIGQKXm4cpap+Flxc/EGnYl0123lcEyzuduqvlDT0AQC3OlFKm/OiqJ8IMTrzJRZ8 phFssTkSrrFXnM2jm5PYDoiTBBMWCgA7FiEEUfF263VHMB6nKairXBt0oWUcX2gFAmTX6T4CGwMF CwkIBwICIgIGFQoJCAsCBBYCAwECHgcCF4AACgkQXBt0oWUcX2hbCQEAtru7kvM8hi8zo6z9ux2h K+B5xViKuo7Z8K3IXuK5ugwA+wUfKzomzdBPhfxDsqLcEziGRxoyx0Q3ld9aermBUccHtBxCasO2 cm4gQmlkYXIgPG1lQHRoYW9kYW4uZGU+iJMEExYKADsCGwMFCwkIBwICIgIGFQoJCAsCBBYCAwEC HgcCF4AWIQRR8XbrdUcwHqcpqKtcG3ShZRxfaAUCZNf2FQAKCRBcG3ShZRxfaCzSAP4hZ7cSp0YN XYpcjHdsySh2MuBhhoPeLGXs+2kSiqBiOwD/TP8AgPEg/R+SI9GI9on7fBJJ0mp2IT8kZ2rhDOjg gA6IkwQTFgoAOxYhBFHxdut1RzAepymoq1wbdKFlH Received-SPF: pass client-ip=185.216.177.71; envelope-from=bjorn.bidar@thaodan.de; helo=thaodan.de X-Spam_score_int: -14 X-Spam_score: -1.5 X-Spam_bar: - X-Spam_report: (-1.5 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, INVALID_MSGID=0.568, RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001, RCVD_IN_VALIDITY_SAFE_BLOCKED=0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:327549 Archived-At: Lynn Winebarger writes: > On Sun, Dec 29, 2024 at 3:37=E2=80=AFPM Daniel Colascione wrote: >> >> Thanks. Such an approach would let us treat tree-sitter grammars a lot >> more like font-lock-keywords, and I think for some modes, that'd be a >> good option. (Of course, SHTDI.) > > The main blocking point for me is a primitive facility for describing > machine-level binary data structures, and operations for manipulating > data according to those specifications. The "bindat" facility is a > step in that direction, but its semantics lacks pointers, which is a > big limitation for simple translation of C data structures from source > code. > >> >> Tree sitter, as wonderful as it is, strikes me as a bit of a Rube >> Goldberg machine architecturally: JS *and* Rust *and* C? Really? :-) > > They evidently decided to use JSON and a simple schema to specify the > concrete grammar, instead of creating a DSL for the purpose. > Javascript is just a convenient way for embedding code into JSON the > same way LISP programmers use lisp to generate S-expressions. Once > you have the JSON format generated, javascript is not used. > > The rest of the project is really composed of orthogonal components, > the GLR grammar compiler (written in Rust) and the run-time GLR > parsing engine, written in C. The grammar compiler produces the > parsing tables in the form of C source code that is compiled together > with the library for a single library per grammar, but the C library > does not actually require the parsing tables to be statically known at > compile-time, at least the last I looked, unless some really obscure > dependence. The procedural interface to the parser just takes a > pointer to the parser table data structure at run-time. > > Since GLR grammars are basically arbitrary (ambiguous) LR(1) grammars, > the parser run-time has to implement a fairly sophisticated algorithm > (graph-stacks) to be efficient. Having implemented the LALR parser > generator at least 3 times in the last couple of decades (just for my > own use), generating the parse tables looks like a lot simpler (and > well-understood) problem to solve than the GLR run-time. More > importantly, the efficiency of the grammar compiler is not all that > critical compared to the run-time. > Additional alernatives instead of Node are already a good alternative. Using WASM as the output format also does not sound bad assuming their is some abstraction from the tree-sitter library side.=20=20 >> >> Some Emacs modes could ship with .js grammars sourced from upstream >> editor-neutral projects. Other modes might just build tree sitter parse >> tables in elisp using something vaguely like SMIE syntax. Both styles >> of mode would be customizable by end users, and we'd (because, I'm a >> broken record, vendor vendor vendor) we'd maintain compatibility without >> mysterious AST-change-related breakages. > > I agree, a generic grammar capturing the structures of most > programming languages would be useful. It is definitely possible to > extract the syntactic/semantic concepts from C++ and Python to create > such a grammar, if you are willing to allow nested grammars > appropriately delimited. For example, a constructor context would > delimit an expression in a data language that is embedded in a > constructor context that may itself have delimited value contexts > where the functional/procedural grammar may appear, ad infinitum. The > procedural and data grammars are distinct but mutually recursive. > That would be if the form appeared in an rvalue-context. For l-value > expressions, the same constructor delimiting syntax can become a > binding form, at least, with subexpressions of binding forms also > being binding forms. As long as the scanner is dynamically set > according to the grammar context (and recognizes/signals the closing > delimiter), the grammar can be made non-ambiguous because a given > character will produce context-appropriate terminal symbols. What kind of scanner are you referring to? Something that works like a binding generator but for AST? > As for vendoring, I just doubt you will get much buy-in in this forum. > There are corporate-type free/open-source software projects that > prioritize uniformity in build environments and limiting the scope of > bugs that can arise from the build process/dependencies that vendor at > the drop of the hat. Then there are "classic" free software projects > that have amalgamated the work of many individual contributors, and > those contributors often prioritize control of the software running on > their systems for whatever reason (but eliminating non-free software > is definitely one of them), and they often can/will contribute patches > for that purpose. The second camp *hates* vendoring because it > subverts their control of their computational resources. At least, > that's the dichotomy I see. There are probably finer points I'm > missing or mischaracterizing. >From my point as a distribution packager there are several reason why vendoring can be bad or in some context keeping them is the better decision. But in this context it complicates the build process as now each grammar has to be built for Emacs in addition to another editors. The Emacs package now pulls in more build dependencies at built time which complicates the built process as the dependency grows. Besides bundled dependencies are not allowed unless there's no way to avoid them. It is not about control or anything. No political reasons from my side besides the preference for free software with copyleft and the end of the GPL-3.0 avoidance. The latter part is not exactly relevant but the term non-free software was mentioned which make the GPL-3.0 avoidance issue relevant as it gave more power to non-free software by giving license with without copyleft a bigger audience.