From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Dmitry Gutov Newsgroups: gmane.emacs.devel Subject: Re: Using incremental parsing in Emacs Date: Fri, 10 Jan 2020 00:56:38 +0300 Message-ID: References: <83blrkj1o1.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="212889"; mail-complaints-to="usenet@blaine.gmane.org" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.9.0 Cc: "emacs-devel@gnu.org" To: arthur miller , Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Thu Jan 09 22:58:20 2020 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1ipfny-000CjJ-2U for ged-emacs-devel@m.gmane-mx.org; Thu, 09 Jan 2020 22:57:26 +0100 Original-Received: from localhost ([::1]:37662 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ipfnw-0002fI-J6 for ged-emacs-devel@m.gmane-mx.org; Thu, 09 Jan 2020 16:57:24 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:40896) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ipfnM-0002FT-Vu for emacs-devel@gnu.org; Thu, 09 Jan 2020 16:56:50 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ipfnL-0001OB-JG for emacs-devel@gnu.org; Thu, 09 Jan 2020 16:56:48 -0500 Original-Received: from mail-lj1-x242.google.com ([2a00:1450:4864:20::242]:45218) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1ipfnL-0001Ks-9h; Thu, 09 Jan 2020 16:56:47 -0500 Original-Received: by mail-lj1-x242.google.com with SMTP id j26so8835706ljc.12; Thu, 09 Jan 2020 13:56:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=ABxnKquU5RbbLE3E5yBwZwngVz85d93nxbCIB+lCPCI=; b=MxlVDbg0SqItWmNi/UkusX4MaeS1+KbzKtr2TdKfNr2oW4JcyfoY8gJxqefnnYJ5Pg bq6pBtFEVGtgOoNuQVSqGfIsfdfnvX8cL3oUAiJpxqxrubCxOS0H48KBHUsNHMAzmhNw SALPnR85RJ+MXG/4lvNgtt2fDmRp6KiwlHLQjYYoO2zwUdTZ5Atflnux2xHf6WJUlPlE hSd1U4ck6ZMe24Nr0NZ22DbhBod0rufdqY0wk97eHofh3yN4lYw3Ip4wGZvidNw73C4b uGblr6IZRNg+oQkp5wjrKAM6TjjZKYmBCGiekQcRFSP0QHhgg+Y7eEQ81qnxXaKp8bNW SoPQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:subject:to:cc:references:from:message-id :date:user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=ABxnKquU5RbbLE3E5yBwZwngVz85d93nxbCIB+lCPCI=; b=snRp2lygFFuE8YUOkRI09Zb5tIiFyIm96kldRM2imNSYVn2Ch9T6Ixna561MCiPHb/ ovdSU456SG8TkPUlU++56aR34+D6Pwlp6aUphBBW+D+MiO0SxVETHDy2Ylm1uaYeyAvg OZNGxkPAhD3CCI3TXOwentBGnfVrC65jDbtcYKjB01qPn3W9gIg/oHZ96p00/YEwi+qL FQuwPxAfdO87K2c97HSCFZawISCc7deHtfVxrPlnBBYmOyWIB/qfJDwkdEgXFJjjgzy9 /Mrije3VYLtMD7hE5/da0K2KPNb9PPViu42GdeIaTl1PKHDSEi+VvY8P8qvSdYCP4Ams ES6A== X-Gm-Message-State: APjAAAXpkHegeUcb+KMeLcolsGHJwrEFwdLwwvNcOBYl5Nfyarfx0+CG Zf+Bba+0aDrwgtEqzGs+RSB+MKhDTFU= X-Google-Smtp-Source: APXvYqxSgdX0jy57PcLYqTXubvft0MbgD4NdsGUWLuJuuFEgHdjbM+7YkMm3qP329otlOzmTzRKvPA== X-Received: by 2002:a2e:868c:: with SMTP id l12mr118423lji.194.1578607003052; Thu, 09 Jan 2020 13:56:43 -0800 (PST) Original-Received: from [192.168.1.142] ([178.252.127.239]) by smtp.googlemail.com with ESMTPSA id 21sm3626607ljv.19.2020.01.09.13.56.39 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 09 Jan 2020 13:56:41 -0800 (PST) In-Reply-To: Content-Language: en-US X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2a00:1450:4864:20::242 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:244165 Archived-At: On 04.01.2020 16:46, arthur miller wrote: > There is a very good presentation of tree-sitter on YT by its author: > > https://www.youtube.com/watch?v=Jes3bD6P0To > > Looks much better then what I got a picture by just reading on the > website: It was a good watch. Some takeaways from me: It implements a GLR parser. One that can update the existing AST quickly for an arbitrary edit in the middle of a file. (*) But it parses a new file quickly as well: a 20000 lines JS file in 54ms. To be able to reach that speed, they went the traditional compiler-writer route of having a separate (grammar-to-C-code) compilation step from a grammar to a parser program (which relies on a shared runtime). (**) Some of it seems to be by necessity. Every run returns a full AST, not just an "AST up to this position". I suppose the author didn't want the problems that come with unfinished parse trees when code relies on that returned value. (***) The generated parser, in addition to being incremental, is error-tolerant, which is a necessity for use in editors. As a result, they have features like fast semantic syntax highlighting, as well code folding that accurately detects where function body begins and ends (previously, Atom and other editors used guessing based on indentation levels, apparently). And a "extend selection" command based on AST as well (****) Tree-Sitter is also used inside GitHub for various features, including their Semantic library (which implements code navigation on the web). In the meantime, our current answer to all of the above is syntax-ppss plus local regexp-based parsing around the visible part of the buffer. To compare: (*) syntax-ppss is also fully incremental, although the returned value is a very simplistic substitute for an AST. But we've been using it for a while and have done solid things with it. (**) Which means that if we try to use Tree-Sitter as-is, our current practice of defining the language grammar in Lisp would go our of the window. https://github.com/ubolonton/emacs-tree-sitter demonstrates this as well: language grammars have to be compiled into a shared library (or libraries). We would have lots of grammars supplied by the third party, which is kind of good, but we would lose the ease of experimenting with them that we have now, or being able to write support for a new up-and-coming language very quickly. Which a certain fraction of our users enjoys, AFAIK. (***) Whereas syntax-ppss stops at a requested position, thus saving on CPU cycles this way. Similarly, if a new system we'll transition to someday also does this, its absolute performance/throughput would be less important if it only usually has to parse a screen-worth of file at a time. (****) We've been managing surprisingly well with syntax-ppss, forward-sexp, etc. So code folding works quite well in Emacs already, and the easy-kill package in GNU ELPA does the "expand selection" thing very successfully as well. But we could use some improvement in having some more complex syntax supported or handled more easily, in certain languages. Having a "proper AST" available is nothing to sneeze at either, and would likely help a lot in indentation code. My personal takeaway is that we could really benefit from a lispier version of this technology, and Someone(tm) should start working on that.