Re: Tokenizing - Daniel Colascione

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

From: Daniel Colascione <dancol@dancol.org>
To: Stephen Leake <stephen_leake@stephe-leake.org>, emacs-devel@gnu.org
Subject: Re: Tokenizing
Date: Mon, 22 Sep 2014 07:14:30 -0700	[thread overview]
Message-ID: <54202EC6.5040304@dancol.org> (raw)
In-Reply-To: <85vbofkb2j.fsf@stephe-leake.org>

[-- Attachment #1: Type: text/plain, Size: 4467 bytes --]

On 09/22/2014 07:02 AM, Stephen Leake wrote:
> Daniel Colascione <dancol@dancol.org> writes:
> 
>> Wagner's thesis contains a description of a few alternative incremental
>> GLR algorithms that look very promising.
> 
> Thanks for the pointer; I really need better error handling in the Ada
> mode parser. It currently defaults to a _really_ dumb indentation
> algorithm (same as previous line) when the parse fails.

Yep. falling back to counting parens seems reasonable enough.

>> We can start parsing at some region we quickly determine is semantically
>> equivalent to the start of the buffer using something like
>> font-lock-beginning-of-syntax-function.  
> 
> For Ada, that's only start of buffer; the standard convention is only
> one compilation-unit (the single grammar start symbol) per file.
> 
>> Here, we can reliably partially parse the buffer by parsing
>> forward (remember, GLR parsing is an online algorithm) until the
>> parsing limit is after the visible portion of the buffer; we can then
>> use a "shortest-to-completion" production (which we can pre-compute
>> from the grammar) to insert virtual tokens that complete the language.
> 
> I had the same idea; I was hoping it was new and I could make money :).
> Is that in Wagner's thesis (I didn't see it on a quick scan), or written
> up somewhere else?

Not AFAIK.

>> One really nice
>> property of GLR parsing (and LR parsers generally) is that we can
>> trivially pause and restart them, so we can parse in the background with
>> very little impact on user activity, using `while-no-input' to back out
>> to the event loop when we detect we have something more important to
>> do.
> 
> That's what I need! Although a really fast GLR parser implemented in
> compiled Ada that accesses the buffer via an Emacs FFI would be better.

Another nice thing about lexing with an automaton and parsing with a
generic algorithm is that the parsing machine can be pushed (if needed)
to C core with no loss of customizability or generality. I'd prefer that
individual modes not rely on the FFI for basic fontification and
indentation services.

>> Error recovery is a challenge --- we want to do something useful on
>> buffers that are temporarily invalided by user edits.  One of the
>> biggest challenges is dealing with mismatched brace constructs, since
>> one missing "}" in C-mode can cause the entire rest of the buffer to no
>> longer conform to the grammar.
> 
> I'm hoping to use the same "shortest-to-completion" algorithm for this.
> It may distort the rest of the buffer, but it will be better than what
> it does now.

Maybe it's better than nothing, but bridge parsing is a bit more
promising, I think. :-)

>> Bridge parsing is particularly interesting here: the basic idea is that
>> we perform one pass over the buffer using a simplified grammar that
>> looks only at nesting constructs, insert virtual braces to make
>> everything balance, and then apply the usual parsing algorithm (with
>> ad-hoc error recovery) to the result. I haven't tried it, but it
>> apparently works pretty well.
> 
> That could work for Ada as well.
> 
>> I like what SMIE does, but I don't think it's powerful enough to parse
>> C++. 
> 
> I wrote a SMIE parser for Ada, but it ended up always parsing the whole
> buffer. So I switched to LALR. Then to generalized so I would not have
> to mangle the grammar in the Ada standard as much.
> 
>> GLR parsers also leave us with a parse forest instead of a parse tree,
>> and we really need a parse *tree* if we're going to apply fontification
>> rules specified in terms of AST nodes.  We need to prune the parse
>> forest down to a parse tree before using it for fontification.
> 
> In Ada mode, I just assume the results of parallel parses that end in
> the same state are equivalent for the purposes of Ada mode
> (fontification and navigation). So I just throw away all but one result.
> I haven't really investigated whether this is true.

You can still end up in the same state having shifted different
subtrees, so if you do it that way, you can flop between different
parsing alternatives as you edit the buffer for no reason at all.

>> Consider the most vexing parse:
>>
>>   type variable(function())
> 
> One reason I like Ada; the equivalent declarations are not ambiguous.

Writing unambiguous languages seems to have fallen out of fashion.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

next prev parent reply	other threads:[~2014-09-22 14:14 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-09-19 14:59 Overlay mechanic improvements Vladimir Kazanov
2014-09-19 17:22 ` Stefan Monnier
2014-09-20 13:19   ` Richard Stallman
2014-09-20 13:37     ` David Kastrup
2014-09-21 13:35       ` Richard Stallman
2014-09-21 13:52         ` David Kastrup
2014-09-21 21:48           ` Richard Stallman
2014-09-21 22:06             ` David Kastrup
2014-09-22 23:11               ` Richard Stallman
2014-09-22 23:50                 ` David Kastrup
2014-09-23 19:15                   ` Richard Stallman
2014-09-21 16:07         ` Stefan Monnier
2014-09-21 16:14           ` David Kastrup
2014-09-21 21:48             ` Richard Stallman
2014-09-21 22:19               ` David Kastrup
2014-09-23 19:16                 ` Richard Stallman
2014-09-23 19:27                   ` David Kastrup
2014-09-28 23:24                     ` Richard Stallman
2014-09-29  5:45                       ` David Kastrup
2014-09-29 20:48                         ` Richard Stallman
2014-09-30  1:21                           ` Stephen J. Turnbull
2014-09-30  8:43                             ` David Kastrup
2014-09-30 10:35                               ` Rasmus
2014-09-30 14:22                                 ` Eli Zaretskii
2014-09-30 16:20                                   ` David Kastrup
2014-09-30 16:35                                     ` Eli Zaretskii
2014-09-30 14:32                                 ` Stefan Monnier
2014-10-02 16:12                                 ` Uwe Brauer
2014-09-30 19:23                             ` Richard Stallman
2014-10-01  3:38                               ` Stephen J. Turnbull
2014-10-01 12:53                                 ` Richard Stallman
2014-10-01 13:11                                   ` David Kastrup
2014-10-02  1:26                                   ` Stephen J. Turnbull
2014-09-30  5:52                           ` David Kastrup
2014-10-06 19:14                             ` Richard Stallman
2014-10-06 21:02                               ` David Kastrup
2014-09-21 16:56           ` Eli Zaretskii
2014-09-21 18:42             ` Stefan Monnier
2014-09-21 18:58               ` Eli Zaretskii
2014-09-21 20:12                 ` Stefan Monnier
2014-09-21 21:48           ` Richard Stallman
2014-09-22  0:31             ` Stefan Monnier
2014-09-22 23:11               ` Richard Stallman
2014-09-20 15:56     ` Eli Zaretskii
2014-09-20 19:49     ` Stefan Monnier
2014-09-21 13:36       ` Richard Stallman
2014-09-19 18:03 ` Richard Stallman
2014-09-20  8:08   ` Vladimir Kazanov
2014-09-20 13:21     ` Richard Stallman
2014-09-20 16:28       ` Stephen Leake
2014-09-20 13:21     ` Tokenizing Richard Stallman
2014-09-20 16:24       ` Tokenizing Stephen Leake
2014-09-20 16:40         ` Tokenizing Vladimir Kazanov
2014-09-20 20:16           ` Tokenizing Eric Ludlam
2014-09-20 20:35             ` Tokenizing Vladimir Kazanov
2014-09-21 15:13             ` parsing (was tokenizing) Stephen Leake
2014-09-20 16:36       ` Tokenizing Vladimir Kazanov
2014-09-20 19:55         ` Tokenizing Stefan Monnier
2014-09-21 15:35           ` Tokenizing Stephen Leake
2014-09-21 16:43             ` Tokenizing Stefan Monnier
2014-09-22 14:05               ` Tokenizing Stephen Leake
2014-09-21 13:35         ` Tokenizing Richard Stallman
2014-09-21 14:24           ` Tokenizing Vladimir Kazanov
2014-09-21 15:32         ` Tokenizing Stephen Leake
2014-09-21 16:42           ` Tokenizing Stefan Monnier
2014-09-21 18:55           ` Tokenizing Vladimir Kazanov
2014-09-21 22:01             ` Tokenizing Daniel Colascione
2014-09-22 10:21               ` Tokenizing Vladimir Kazanov
2014-09-22 13:55                 ` Tokenizing Daniel Colascione
2014-09-22 14:02               ` Tokenizing Stephen Leake
2014-09-22 14:14                 ` Daniel Colascione [this message]
2014-09-22 13:15             ` Tokenizing Stephen Leake

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54202EC6.5040304@dancol.org \
    --to=dancol@dancol.org \
    --cc=emacs-devel@gnu.org \
    --cc=stephen_leake@stephe-leake.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.