From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Stephen Leake <stephen_leake@stephe-leake.org>
Newsgroups: gmane.emacs.devel
Subject: Re: Tokenizing
Date: Mon, 22 Sep 2014 09:02:28 -0500
Message-ID: <85vbofkb2j.fsf@stephe-leake.org>
References: <CAAs=0-1-1KCMOvWm7+kt8Ddiu1BZznoUpB2SVFAM_fV3zNVE8A@mail.gmail.com>
	<E1XV2XA-0002kq-7t@fencepost.gnu.org>
	<CAAs=0-0Jr7kiJuqmG_sxrO6VAMf7NEoJ-FAed-ZNE0H5srBJ8A@mail.gmail.com>
	<E1XVKb1-0006SZ-Un@fencepost.gnu.org>
	<CAAs=0-05L9nV69HPWTfEgtEq_x8LiGivmCo7vOJEr5dDJMVCsg@mail.gmail.com>
	<85ha01dm5u.fsf@stephe-leake.org>
	<CAAs=0-0jNfgkBVoQaV+GWBRCu=-yLteJU83+4EbVin4a5fqTGg@mail.gmail.com>
	<541F4AD0.3060109@dancol.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain
X-Trace: ger.gmane.org 1411394582 5124 80.91.229.3 (22 Sep 2014 14:03:02 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Mon, 22 Sep 2014 14:03:02 +0000 (UTC)
To: emacs-devel@gnu.org
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Sep 22 16:02:54 2014
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1XW4Cb-0004gI-13
	for ged-emacs-devel@m.gmane.org; Mon, 22 Sep 2014 16:02:53 +0200
Original-Received: from localhost ([::1]:46802 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1XW4Ca-0003yz-Lh
	for ged-emacs-devel@m.gmane.org; Mon, 22 Sep 2014 10:02:52 -0400
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:37268)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <stephen_leake@stephe-leake.org>) id 1XW4CR-0003vP-Ow
	for emacs-devel@gnu.org; Mon, 22 Sep 2014 10:02:50 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <stephen_leake@stephe-leake.org>) id 1XW4CL-0007jW-9E
	for emacs-devel@gnu.org; Mon, 22 Sep 2014 10:02:43 -0400
Original-Received: from dnvrco-outbound-snat.email.rr.com ([107.14.73.226]:58635
	helo=dnvrco-oedge-vip.email.rr.com)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <stephen_leake@stephe-leake.org>) id 1XW4CL-0007j2-48
	for emacs-devel@gnu.org; Mon, 22 Sep 2014 10:02:37 -0400
Original-Received: from [70.94.38.149] ([70.94.38.149:50499] helo=TAKVER)
	by dnvrco-oedge01 (envelope-from <stephen_leake@stephe-leake.org>)
	(ecelerity 3.5.0.35861 r(Momo-dev:tip)) with ESMTP
	id 16/28-18326-6FB20245; Mon, 22 Sep 2014 14:02:30 +0000
In-Reply-To: <541F4AD0.3060109@dancol.org> (Daniel Colascione's message of
	"Sun, 21 Sep 2014 15:01:52 -0700")
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (windows-nt)
X-RR-Connecting-IP: 107.14.64.118:25
X-Authority-Analysis: v=2.1 cv=btPCBSqi c=1 sm=1 tr=0
	a=AppmJ/7ZOOFWL/q6u6u93g==:117 a=AppmJ/7ZOOFWL/q6u6u93g==:17
	a=ayC55rCoAAAA:8 a=9XSUBuVRJI8A:10 a=o_R75loqY_IA:10
	a=9i_RQKNPAAAA:8 a=KxCsr730AAAA:8 a=rPc5LbpOMhw_17ECiEAA:9
	a=lEV88E3HW2IA:10
X-Cloudmark-Score: 0
X-detected-operating-system: by eggs.gnu.org: BaiduSpider 
X-Received-From: 107.14.73.226
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:174645
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/174645>

Daniel Colascione <dancol@dancol.org> writes:

> Wagner's thesis contains a description of a few alternative incremental
> GLR algorithms that look very promising.

Thanks for the pointer; I really need better error handling in the Ada
mode parser. It currently defaults to a _really_ dumb indentation
algorithm (same as previous line) when the parse fails.

> We can start parsing at some region we quickly determine is semantically
> equivalent to the start of the buffer using something like
> font-lock-beginning-of-syntax-function.  

For Ada, that's only start of buffer; the standard convention is only
one compilation-unit (the single grammar start symbol) per file.

> Here, we can reliably partially parse the buffer by parsing
> forward (remember, GLR parsing is an online algorithm) until the
> parsing limit is after the visible portion of the buffer; we can then
> use a "shortest-to-completion" production (which we can pre-compute
> from the grammar) to insert virtual tokens that complete the language.

I had the same idea; I was hoping it was new and I could make money :).
Is that in Wagner's thesis (I didn't see it on a quick scan), or written
up somewhere else?

> One really nice
> property of GLR parsing (and LR parsers generally) is that we can
> trivially pause and restart them, so we can parse in the background with
> very little impact on user activity, using `while-no-input' to back out
> to the event loop when we detect we have something more important to
> do.

That's what I need! Although a really fast GLR parser implemented in
compiled Ada that accesses the buffer via an Emacs FFI would be better.

> Error recovery is a challenge --- we want to do something useful on
> buffers that are temporarily invalided by user edits.  One of the
> biggest challenges is dealing with mismatched brace constructs, since
> one missing "}" in C-mode can cause the entire rest of the buffer to no
> longer conform to the grammar.

I'm hoping to use the same "shortest-to-completion" algorithm for this.
It may distort the rest of the buffer, but it will be better than what
it does now.

> Bridge parsing is particularly interesting here: the basic idea is that
> we perform one pass over the buffer using a simplified grammar that
> looks only at nesting constructs, insert virtual braces to make
> everything balance, and then apply the usual parsing algorithm (with
> ad-hoc error recovery) to the result. I haven't tried it, but it
> apparently works pretty well.

That could work for Ada as well.

> I like what SMIE does, but I don't think it's powerful enough to parse
> C++. 

I wrote a SMIE parser for Ada, but it ended up always parsing the whole
buffer. So I switched to LALR. Then to generalized so I would not have
to mangle the grammar in the Ada standard as much.

> GLR parsers also leave us with a parse forest instead of a parse tree,
> and we really need a parse *tree* if we're going to apply fontification
> rules specified in terms of AST nodes.  We need to prune the parse
> forest down to a parse tree before using it for fontification.

In Ada mode, I just assume the results of parallel parses that end in
the same state are equivalent for the purposes of Ada mode
(fontification and navigation). So I just throw away all but one result.
I haven't really investigated whether this is true.

> Consider the most vexing parse:
>
>   type variable(function())

One reason I like Ada; the equivalent declarations are not ambiguous.

-- 
-- Stephe