From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Yuan Fu Newsgroups: gmane.emacs.devel Subject: Re: Update on tree-sitter structure navigation Date: Sat, 2 Sep 2023 15:09:08 -0700 Message-ID: References: <5E7F2A94-4377-45C0-8541-7F59F3B54BA1@gmail.com> <87h6odhxs6.fsf@localhost> Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3731.700.6\)) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="6737"; mail-complaints-to="usenet@ciao.gmane.io" Cc: emacs-devel , Danny Freeman , Theodor Thornhill , =?utf-8?Q?Jostein_Kj=C3=B8nigsen?= , Randy Taylor , Wilhelm Kirschbaum , Perry Smith , Dmitry Gutov To: Ihor Radchenko Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Sun Sep 03 00:10:21 2023 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1qcYou-0001Za-Qt for ged-emacs-devel@m.gmane-mx.org; Sun, 03 Sep 2023 00:10:20 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1qcYo4-0001Uw-Nc; Sat, 02 Sep 2023 18:09:28 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qcYo3-0001UZ-6h for emacs-devel@gnu.org; Sat, 02 Sep 2023 18:09:27 -0400 Original-Received: from mail-pf1-x432.google.com ([2607:f8b0:4864:20::432]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1qcYny-0002v8-92 for emacs-devel@gnu.org; Sat, 02 Sep 2023 18:09:25 -0400 Original-Received: by mail-pf1-x432.google.com with SMTP id d2e1a72fcca58-68a42d06d02so149440b3a.0 for ; Sat, 02 Sep 2023 15:09:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1693692561; x=1694297361; darn=gnu.org; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=INP6qjsEhrkNiyLUB4c2QaB1o9QAnTen7fYWdX6aGS4=; b=jyvjr5EVuXifjgmeDLsqHCALvVkMmOQk4WX71vti1q6Q1zzj9rmvh377nIwkGGFTIh nCGeGoV4LTgvBoO/oJER6ZYiF7YvRTr7W7jcVhn6FEHAEHo7+FF4D8IdjIWMewSOnara vz4Dyw8Y/pPCLqcvWcAnEOsPwQjAsTVi4Oq7zTsfDNBnvKSAQcHD03EZX5/dbpXL31+B OjzkoS0o3LeRu0lRrEe090j1bRndjdtnd8F9osYAO6CgCXDavrh5DCQIejwUU7WH0fPj GuTAYgZx/XHsQg6M9N303EmhHYf1JikQAGRuKeCzrXLxdAkxMKxv3hak1lObrK3FZ4Ez h4mQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693692561; x=1694297361; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=INP6qjsEhrkNiyLUB4c2QaB1o9QAnTen7fYWdX6aGS4=; b=fAwQ/1bsgchtE64qcWdUcw0ZgzkCkfnFZVt1WVj93MdUyCYvzBbioL0NmrdJc5Y+0z Dn7Gxyl6o70WRmIe8jPKSrUMk4BUK/yioUNQolNk4mKL20yAycxVDN+QdcPYlL6rkMI0 ai6l00YDXhEHQQif17Vu4ES1rp7s5thPp5lAecmluTRHe+9XtQtK3U29B6gNrePGNQkD icA+OwyVHfj+Kl8JnwOGsQ6CHNZrupKQR6YsQjK86H9+GA10qSMSw4TKKgb4f7RXgQRy 8bEiag2jTJ6Y5HiT13Ih+aPLnrFqiSBMRAq1ERj99J+Wpryjc2hLfHpsIIPA2a/DW1YI m5iQ== X-Gm-Message-State: AOJu0YzP6qb1/eEZBVBU5FfGtT+WmA455q8t4TuxLskjugfNqVJewn4Q BYVbCj8sW7un27L48iizrp8= X-Google-Smtp-Source: AGHT+IFjYRmB8tzTi1yueSC7qcFqha96Q4Ixp5vN4O9M8wQziaF3p5UitZ1byJtBTvgVD9KURUIWfA== X-Received: by 2002:a05:6a00:23c7:b0:68a:3e68:f887 with SMTP id g7-20020a056a0023c700b0068a3e68f887mr6196302pfc.29.1693692560774; Sat, 02 Sep 2023 15:09:20 -0700 (PDT) Original-Received: from smtpclient.apple (cpe-172-117-161-177.socal.res.rr.com. [172.117.161.177]) by smtp.gmail.com with ESMTPSA id y21-20020aa78055000000b0068338b6667asm4879421pfm.212.2023.09.02.15.09.19 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sat, 02 Sep 2023 15:09:20 -0700 (PDT) In-Reply-To: <87h6odhxs6.fsf@localhost> X-Mailer: Apple Mail (2.3731.700.6) Received-SPF: pass client-ip=2607:f8b0:4864:20::432; envelope-from=casouri@gmail.com; helo=mail-pf1-x432.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:309915 Archived-At: > On Sep 1, 2023, at 11:52 PM, Ihor Radchenko = wrote: >=20 > Yuan Fu writes: >=20 >> In the months after wrapping up tree-sitter stuff in emacs-29, I was >> thinking about how to implement structural navigation and extracting >> information from the parser with tree-sitter. In emacs-29 we have >> things like treesit-beginning/end-of-defun, and treesit-defun-name. I >> was thinking maybe we can generalize this to support getting = arbitrary >> =E2=80=9Cthing=E2=80=9D at point, move around them, and getting = information like the >> name of a defun, its arglist, parent of a class, type of an variable >> declaration, etc, in a language-agnostic way. >=20 > Note that Org mode also does all of these using > https://orgmode.org/worg/dev/org-element-api.html >=20 > It would be nice if we could converge to more consistent interface > across all the modes. For example, by extending `thing-at-point' to = handle > parsed elements, not just simplistic regexp-based "thing" boundaries > exposed by `thing-at-point' now. >=20 > Org approaches getting name/begin/end/arguments using a common API: >=20 > (org-element-property :begin NODE) > (org-element-property :end NODE) > (org-element-property :contents-begin NODE) > (org-element-property :contents-end NODE) > (org-element-property :name NODE) > (org-element-property :args NODE) >=20 > Language-agnostic "thing"s will certainly be welcome, especially given > that tree-sitter grammars use inconsistent naming schemes, which have = to > be learned separately, and may even change with grammar versions. >=20 > I think that both NODE types and attributes can be standardized. If we come up with a thing-at-point interface that provides more = information than the current (BEG . END), tree-sitter surely can support = it as a backend. Just need SomeOne to come up with it :-) But I don=E2=80=99= t see how this interface can support semantic information like arglist = of a defun, or type of a declaration=E2=80=94these things are not = universal to all =E2=80=9Cnodes=E2=80=9D. >=20 >> Also, at the time, we only support defining things by a regexp >> matching a node=E2=80=99s type, which is often not enough. >>=20 >> And it would be nice to somehow take advantage of the tree-sitter >> queries for the features I mentioned above. Tree-sitter query is what >> every other editor are using for virtually all tree-sitter related >> features. But in Emacs, we mostly only use it for font-lock. >=20 > I recall one user asking about something like VIM's textobjects via > tree-sitter queries. Example: > = https://github.com/nvim-treesitter/nvim-treesitter-textobjects/blob/master= /queries/cpp/textobjects.scm I think that=E2=80=99s something that can be implemented with thing = definitions. >> Here=E2=80=99s the progress as of now: >>=20 >> - Functions like treesit-search-forward, treesit-induce-sparse-tree, >> treesit-thing-at-point, treesit--navigate-thing, etc, support a = richer >> set of predicates now. Besides regexp matching the type, the = predicate >> can also be a predication function, or (REGEP . FUNC), or compound >> predicates like (or PRED PRED) or (not PRED). >=20 > Slightly unrelated, but do you have any idea if it can be faster to = use > Emacs' regexp search combined with treesit-thing-at-point vs. pure > tree-sitter query? Not really. >=20 >> - There=E2=80=99s now a variable treesit-thing-settings, which holds >> definition for things. Then, instead of passing the predicate to the >> functions I mentioned above, you can save the predicate in >> treesit-thing-settings under a symbol, say =E2=80=98sexp', and pass = the symbol >> instead, just like thing-at-point.el. (We=E2=80=99ll work on = integrating with >> thing-at-point.el later.) >=20 > This sounds similar to textobjects I linked above. > One question: how will it integrate with multiple parsers in one = buffer? This only concerns with checking if a node satisfies the definition of a = =E2=80=9Cthing=E2=80=9D, and doesn=E2=80=99t care how you get the node. = Retrieving node through either treesit-node-at or other functions = already works with multiple parsers. Also the =E2=80=9Cthing=E2=80=9D definition is language-specific. >=20 >> - I can=E2=80=99t think of a good way to integrate tree-sitter = queries with >> the navigation functions we have right now. Most importantly, >> tree-sitter query always search top-down, and you can=E2=80=99t limit = the >> depth it searches. OTOH, our navigation functions work by traversing >> the tree node-to-node. >=20 > May you elaborate about the difficulties you encountered? Ideally I=E2=80=99d like to pass a query and a node to = treesit-node-match-p, which returns t if the query matches the node. But = queries don=E2=80=99t work like that. They search the node and returns = all the matches within that node, which could be potentially wasteful. >=20 >> Some other things on the TODO list that people can take a jab at: >>=20 >> - Solve the grammar versioning/breaking-change problem: tree-sitter = grammar don=E2=80=99t have a version number, so every time the author = changes the grammar, our queries break, and loading the mode only = produces a giant error. >=20 > May we somehow get a hash of the library? That way, we can at least > detect if something has changed. All we get is a binary dynamic library. So I don=E2=80=99t think so. >=20 >> - Major mode fallback/inheritance, this has been discussed many = times, no good solution emerged. >=20 > I think that integration of tree-sitter with navigation functions = might > be a step towards solving this problem. If common Emacs commands can > automatically choose between tree-sitter and classic implementations, = it > might become easier to unify foo-ts-mode with foo-mode. Unifying tree-sitter and non-tree-sitter modes creates many problems. = I=E2=80=99m rather thinking about some way to share some configuration = between two modes. We=E2=80=99ve had many discussions before with no = fruitful conclusion. >=20 >> - Isolated ranges. For many embedded languages, each blocks should be = independent from another, but currently all the embedded blocks are = connected together and parsed by a single parser. We probably need to = spawn a parser for each block. I=E2=80=99ll probably work on this one = next. >=20 > Do you mean that a single parser sees subsequent block as a = continuation > of the previous? Exactly. Yuan=