From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail
From: Eric Ludlam <ericludlam@gmail.com>
Newsgroups: gmane.emacs.devel
Subject: Re: Why tree-sitter instead of Semantic? (was Re: CC Mode with
 font-lock-maximum-decoration 2)
Date: Sat, 20 Aug 2022 09:15:36 -0400
Message-ID: <d452bbb6-b04e-8bfe-7936-e4a3d149c28d@gmail.com>
References: <YvFYvQt+xVJHmona@ACM> <83o7wuva9o.fsf@gnu.org>
 <YvFffq5Z/jTaAMra@ACM> <83mtceupbx.fsf@gnu.org> <YvIUEEAi00E5Ooa6@ACM>
 <83lerxvfnu.fsf@gnu.org> <YvJD5JMvJd+BIqEt@ACM> <838rnxvdcq.fsf@gnu.org>
 <YvKM9NF2e9v9yUmO@ACM> <83r11ptksn.fsf@gnu.org> <YvKcw0IurgWRsL9G@ACM>
 <83a68dti6w.fsf@gnu.org>
 <CAM=F=bB4GjNCaGGDCAGT2q98bky-AhruinemoMkMf3COeT3KjQ@mail.gmail.com>
 <c706a0f2-1e97-c9af-2fca-17d74dea3518@secure.kjonigsen.net>
 <87a687sjnv.fsf@yahoo.com>
 <CAM=F=bDSYwzpukCgwVcSMqb_5ejQkb78+dh=Dur88H47tGo2SQ@mail.gmail.com>
 <83zgg4fm9p.fsf@gnu.org>
 <CAM=F=bC-CU5qa6D+4PeDHf+axKewkd11OwAKtEUP1QBU-2wUoQ@mail.gmail.com>
 <jwvmtc4f6tl.fsf-monnier+emacs@gnu.org>
 <CAM=F=bCHrXTeT6XrVNk=vO+wbB4aD4rFLb_C-wLwKk3yqO1JKQ@mail.gmail.com>
 <e6d58072-6eb0-9557-209e-02d0d0c02a94@siege-engine.com>
 <CAM=F=bB6sWUXvwDi5iigBN-n5JKZdyhUWO8TXU5FEPHOROWMHw@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214";
	logging-data="22596"; mail-complaints-to="usenet@ciao.gmane.io"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.11.0
Cc: Stefan Monnier <monnier@iro.umontreal.ca>, Eli Zaretskii <eliz@gnu.org>,
 luangruo@yahoo.com, jostein@secure.kjonigsen.net, jostein@kjonigsen.net,
 acm@muc.de, emacs-devel@gnu.org, casouri@gmail.com
To: Lynn Winebarger <owinebar@gmail.com>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Sat Aug 20 15:17:08 2022
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane-mx.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.92)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>)
	id 1oPOLb-0005fd-Vb
	for ged-emacs-devel@m.gmane-mx.org; Sat, 20 Aug 2022 15:17:08 +0200
Original-Received: from localhost ([::1]:34410 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>)
	id 1oPOLa-0000QE-HG
	for ged-emacs-devel@m.gmane-mx.org; Sat, 20 Aug 2022 09:17:06 -0400
Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:42968)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <ericludlam@gmail.com>)
 id 1oPOKF-0007xv-Gt
 for emacs-devel@gnu.org; Sat, 20 Aug 2022 09:15:43 -0400
Original-Received: from mail-qt1-x82b.google.com ([2607:f8b0:4864:20::82b]:33712)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
 (Exim 4.90_1) (envelope-from <ericludlam@gmail.com>)
 id 1oPOKC-0003I2-Ty; Sat, 20 Aug 2022 09:15:43 -0400
Original-Received: by mail-qt1-x82b.google.com with SMTP id cb8so5121438qtb.0;
 Sat, 20 Aug 2022 06:15:39 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
 h=content-transfer-encoding:in-reply-to:from:references:cc:to
 :content-language:subject:user-agent:mime-version:date:message-id
 :from:to:cc; bh=OIlkq136brVYRsjYaUO2g48GUw/xqs9s1jqajBaIqkw=;
 b=TPwKxEFWwSmPvG8z+Lg7KdWmgTfA46BTTmvJfmBuriEuSpWnCUev2eYU477NMi7m0L
 zL+phRus4vuL739/WOAlhCWT7dwpz5aOw7IJxUDKC9yks4R6hwc1XUtrqxw08E/DGlak
 HC+od1UGizcM/6ctTvj+7kRHtqXvch7uMf4eI0hO/Fo4sQN5dsufYzSnahyTg0oK7EmG
 WgytafhA9oHYTlR7fIrlbVQ40cHg++jMVASBveSeuabxywsgVlC0eWW9l02sKcbfqWmY
 3/zUgG3QSazKjnwZhZpKiUXEZgpFZah96yW5DnoilcTyKh+xwOjh1suRp8qc4Sd4Oemf
 hGmw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=content-transfer-encoding:in-reply-to:from:references:cc:to
 :content-language:subject:user-agent:mime-version:date:message-id
 :x-gm-message-state:from:to:cc;
 bh=OIlkq136brVYRsjYaUO2g48GUw/xqs9s1jqajBaIqkw=;
 b=VJIl18K7rJNRNBx2Ts4YZzetXvfRotq60zPjkmgXnXg+nDYZRQw6+00ZsrFTORXBjV
 q9Uapgbh9VfBGkVQr47CZJqElXf3yxfm9kAa/BzioqXwo717dOB/Ogp38eYgthoxlAb0
 8SHOpj3CuX8+KpN2LMHSU/1RjtIynyEOfPS06e931sLHjU7nWs9+cqnaK2o2TymKu4JN
 JmbPykVflW2iOIgI9a0nf9UkVrRtEI8Zr5loc6UfsxG9qswVwQjiTOQuLKBWUPKtFv0G
 a5DC0sq4/PrYpSPximvRKW8LyC1S3km8WfN393nsuTjloLLA77GtsyXmdAXQmctd70IY
 MrhQ==
X-Gm-Message-State: ACgBeo1oKK9CNR1K0dwfXnXsUidedy97KX0OgpqPXDGV8+prMXt+vfaJ
 00JZx7f2HkskvPlO1gfM2wg=
X-Google-Smtp-Source: AA6agR4h7rvZsjgUeb5gOjKV7BPhUXW80X17LdFooTVF3R44apQwPjFE72zDbNynnMeLAFsPLrvrvA==
X-Received: by 2002:a05:622a:4cd:b0:343:65a4:e212 with SMTP id
 q13-20020a05622a04cd00b0034365a4e212mr9490522qtx.526.1661001338753; 
 Sat, 20 Aug 2022 06:15:38 -0700 (PDT)
Original-Received: from [192.168.1.161] (pool-108-20-30-136.bstnma.fios.verizon.net.
 [108.20.30.136]) by smtp.googlemail.com with ESMTPSA id
 o10-20020ac87c4a000000b0034305a91aaesm5545314qtv.83.2022.08.20.06.15.36
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Sat, 20 Aug 2022 06:15:37 -0700 (PDT)
Content-Language: en-US
In-Reply-To: <CAM=F=bB6sWUXvwDi5iigBN-n5JKZdyhUWO8TXU5FEPHOROWMHw@mail.gmail.com>
Received-SPF: pass client-ip=2607:f8b0:4864:20::82b;
 envelope-from=ericludlam@gmail.com; helo=mail-qt1-x82b.google.com
X-Spam_score_int: -20
X-Spam_score: -2.1
X-Spam_bar: --
X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001,
 NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001,
 SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org
Original-Sender: "Emacs-devel"
 <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>
Xref: news.gmane.io gmane.emacs.devel:293665
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/293665>

On 8/18/22 8:34 AM, Lynn Winebarger wrote:
> On Tue, Aug 16, 2022 at 9:41 PM Eric Ludlam <ericludlam@gmail.com> wrote:
>> On 8/16/22 1:40 PM, Lynn Winebarger wrote:
>>> On Tue, Aug 16, 2022 at 1:19 PM Stefan Monnier <monnier@iro.umontreal.ca> wrote:
>>>>> I'm only saying there's a disconnect between Jostein's report and Po's
>>>>> response.  It's probably a UI issue.  There's a checkbox in a dropdown
>>>>> menu that says "Source Code Parsers (Semantic)".
>>>> FWIW, I've used (semantic-mode 1) to enable CEDET in Emacs's C source
>>>> files and that was all that was needed to get TAB completion of struct
>>>> field's names working.
>>>> I haven't used it for much more than that, admittedly.
>>> It also works for me, but I also have been mostly looking at Emacs
>>> source with it, and Semantic knows how to use the TAGS file for
>>> context-sensitive completion in C.  And something is working
>>> gangbusters in Elisp, but unfortunately I can't really identify which
>>> package is doing the work.
>>>
>>>>> *  "${" and "{" could both open a block closed by "}"
>>>> Why do you think it's a problem?
>>> If you want the lexer to tokenize the ${ as a symbol while still
>>> recognizing the text in between as delimited, it seems like a problem.
>>>     I mean, I already deal with that in ordinary font-lock, I was hoping
>>> the parser/lexer generation would address the issue independently of
>>> syntax tables.
>> Lexers are built per-language from a set of analyzers.  Thus, you call
>> (define-lex ...) and list a bunch of analyzers, which are created with
>> `define-lex-analyzer' or one of the variants.
>>
>> The analyzers mostly use regular expressions, and when possible, uses
>> expressions that use the syntax table because they are quite fast.  If
>> you restrict yourself to the built-in named lexer analyzers, like
>> 'semantic-lex-whitespace', then that is what they are, but you can use
>> `define-lex-analyzer' or `define-lex-regex-analyzer' and write any code
>> you want to do a match, push a token, and find the end point.  The C
>> lexer/parser does this a lot.
>>
>> For a very simple case like matching ${:
>> (define-lex-simple-regex-analyzer my-dollar-curly
>>    "doc string"
>>    "\\$\\{" 'dollar-curly)
>>
>> and then put this in front of the { } block analyzer when you build up
>> your lexer.
> Thanks for the details.  I'm not sure what you mean by "put this in
> front of the ... block analyzer" though.  I just don't understand how
> the different token types interact with each other and/or the "block"
> (or other) construct well enough to confidently use the built-in
> types.
> What I will take away here is that I can closely review the C
> lexer/parser to see how someone who does understand the interaction of
> those types uses them effectively, before investing a lot of time
> studying the construction of the built-in types for the purpose of
> extending them.  Which I'm not sure I would do for the problem I'm
> currently dealing with in any case.
> Am I right that the "block" classification is used to allow Semantic
> to localize the impact of unparseable text?  It sounds like the system
> will still function without explicitly declaring block constructs, but
> some useful features might be effectively disabled.
Building a lexer is done in two steps.  In one step, you would build 
some analyzers for specific matches such as the example above.  Once you 
have a set of analyzers for specific syntaxes, you assemble them into a 
lexer, like this:

(define-lex my-lexer
   "Doc string"
   semantic-lex-ignore-whitespace
   ;; Custom stuff that conflicts with blocks
   my-dollar-curly
   ;; Do some blocks
   semantic-lex-paren-or-list
   semantic-lex-close-paren
   ;; Other stuff
   semantic-lex-number
   ;; End with this
   semantic-lex-default-action)
Hopefully this explains the basics of building out some analyzers and 
your lexer.

If you are building out a lexer just to do some tokenizing, then this is 
about what you need, plus what is in the documentation for more details.

If you want to build a parser that sits on the lexer, there is more to 
it, as I recommend using the wisent parser-generator, as it creates 
faster parsers.  In the wisent .wy files, you define %tokens using a 
bison-like syntax, and that in turns builds analyzers that you include 
in your lexer.  The java parser & lexer has a lot of cases, though the 
calc parser is smaller and easier to grok.

The purpose of 'block' constructs in the lexer is to just cut-out large 
chunks of text that you don't have to write a parser generator for.  My 
goal was creating tags, and parsing the body of a function, for example, 
is not needed.  Thus using the lexer to skip all that speeds things up.  
If you want to parse the ENTIRE file, just don't put blocks in your 
lexer, and only put in the open/close paren analyzers.

Hope this helps.
Eric