From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Vladimir Kazanov <vekazanov@gmail.com>
Newsgroups: gmane.emacs.devel
Subject: Re: Tokenizing
Date: Mon, 22 Sep 2014 13:21:47 +0300
Message-ID: <CAAs=0-3grY7DYkQDr9vbccp2iTUzkJWnFwSmOmu0nhSVB4ZzxQ@mail.gmail.com>
References: <CAAs=0-1-1KCMOvWm7+kt8Ddiu1BZznoUpB2SVFAM_fV3zNVE8A@mail.gmail.com>
	<E1XV2XA-0002kq-7t@fencepost.gnu.org>
	<CAAs=0-0Jr7kiJuqmG_sxrO6VAMf7NEoJ-FAed-ZNE0H5srBJ8A@mail.gmail.com>
	<E1XVKb1-0006SZ-Un@fencepost.gnu.org>
	<CAAs=0-05L9nV69HPWTfEgtEq_x8LiGivmCo7vOJEr5dDJMVCsg@mail.gmail.com>
	<85ha01dm5u.fsf@stephe-leake.org>
	<CAAs=0-0jNfgkBVoQaV+GWBRCu=-yLteJU83+4EbVin4a5fqTGg@mail.gmail.com>
	<541F4AD0.3060109@dancol.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Trace: ger.gmane.org 1411381350 23436 80.91.229.3 (22 Sep 2014 10:22:30 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Mon, 22 Sep 2014 10:22:30 +0000 (UTC)
Cc: Stephen Leake <stephen_leake@stephe-leake.org>, emacs-devel@gnu.org
To: Daniel Colascione <dancol@dancol.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Sep 22 12:22:23 2014
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1XW0lB-0003xN-PL
	for ged-emacs-devel@m.gmane.org; Mon, 22 Sep 2014 12:22:22 +0200
Original-Received: from localhost ([::1]:44521 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1XW0lB-0000jG-1T
	for ged-emacs-devel@m.gmane.org; Mon, 22 Sep 2014 06:22:21 -0400
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:54516)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <vekazanov@gmail.com>) id 1XW0l7-0000iz-3u
	for emacs-devel@gnu.org; Mon, 22 Sep 2014 06:22:18 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <vekazanov@gmail.com>) id 1XW0l3-0002qu-1i
	for emacs-devel@gnu.org; Mon, 22 Sep 2014 06:22:17 -0400
Original-Received: from mail-ig0-x233.google.com ([2607:f8b0:4001:c05::233]:53026)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <vekazanov@gmail.com>) id 1XW0l2-0002qL-Qv
	for emacs-devel@gnu.org; Mon, 22 Sep 2014 06:22:12 -0400
Original-Received: by mail-ig0-f179.google.com with SMTP id l13so2404121iga.6
	for <emacs-devel@gnu.org>; Mon, 22 Sep 2014 03:22:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:in-reply-to:references:from:date:message-id:subject:to
	:cc:content-type:content-transfer-encoding;
	bh=3LkH3mhF//XMP8hWawZP2oe5GGDqXup/GbTaw7fsiPk=;
	b=fkEkQJvfaZhNZHEfHRTZZIEImd6wo4yKyVzj+CST2Sk76mogrMs0DL8jfT1VREAV8C
	IcR4jRyZ15KsEFLijVxP0xNjfnXDlOoQSfEYgsDugC0t3OKFZkXtm/pQtaucQECrTSyz
	b1rRurP4zY2AEESRWBVAak5u5ogWY+1u1gZhri0TWsS+rN5E2Xd8pBe89RC45DkM1thD
	JbjBR7ohKSeiR3C7iE4UYQHt8B9c9ISNwI1DLQLsKCrcZ9rCD1Phmi7umQi6xH3GyNd4
	6DnED7BRvctViwmvdJTgNwELTLSsoIFmws3orhL39ZLxQTGumBQe/FK8LqudLkwzMhbq
	KTxg==
X-Received: by 10.50.136.167 with SMTP id qb7mr13483835igb.31.1411381327192;
	Mon, 22 Sep 2014 03:22:07 -0700 (PDT)
Original-Received: by 10.107.18.133 with HTTP; Mon, 22 Sep 2014 03:21:47 -0700 (PDT)
In-Reply-To: <541F4AD0.3060109@dancol.org>
X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address
	(bad octet value).
X-Received-From: 2607:f8b0:4001:c05::233
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:174642
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/174642>

On Mon, Sep 22, 2014 at 1:01 AM, Daniel Colascione <dancol@dancol.org> wrot=
e:

> I've been working (very, very, very slowly) on similar functionality.
> The basic idea is based on the incremental lexing algorithm that Tim A.
> Wagner sets out in chapter 5 of his thesis [1].  The key is dynamically
> tracking lookahead used while we generate each token.  Wagner's
> algorithm allows us to incorporate arbitrary lookahead into the
> invalidation state, so supporting something like flex's unlimited
> trailing context is no problem.
>
> The nice thing about this algorithm is that like the parser, it's an
> online algorithm and arbitrarily restartable.

I have already mentioned Wagner's paper in the previous letters.
Actually, it is the main source of inspiration :-) But I think it is a
bit over-complicated, and the only implementation I saw (Netbean's
Lexer API) does not even try to implement it completely. Which is
okay, academic papers tend to idealize things.

> We can't use the built-in Emacs regular expression engine for this
> facility because there's no way to tell regex.c to record how much
> lookahead it used, and even if there were, I'd rather not rely on its
> backtracking matcher. I've written a DFA-based regex matcher that should
> help in implementing this algorithm; of course, a DFA matcher has a
> lookahead of zero.
>
> Mine supports zero-width predicates though, so we can actually use
> achieve nonzero lookahead (and lookbehind!) if needed.

You do realize that this is a client's code problem? We can only
recommend to use this or that regex engine, or even set the lookahead
value for various token types by hand; and the latter case would
probably work for most real-life cases.

I am not even sure that it is possible to do it the Wagner's way (have
a real next_char() function) in Emacs. I would check Lexer API
solution as a starting point.

> Where my thing departs from flex is that I want to use a regular
> expression (in the rx sense) to describe the higher-level parsing
> automaton instead of making mode authors fiddle with start states.  This
> way, it's easy to incorporate support for things like JavaScript's
> regular expression syntax, in which "/" can mean one of two tokens
> depending on the previous token.
>
> (Another way of dealing with lexical ambiguity is to let the lexer
> return an arbitrary number of tokens for a given position and let the
> GLR parser sort it out, but I'm not as happy with that solution.)

I do not want to solve any concrete lexing problems. The whole point
is about supplying a way to do it incrementally. I do not want to know
anything about the code above or below , be it GLR/LR/flex/etc.

>
> There are two stages here: you want in *some* cases for fontification to
> use the results of tokenization directly; in other cases, you want to
> apply fontification rules to the result of parsing that token stream.
> Splitting the fontification rules between terminals and non-terminals
> this way helps us maintain rudimentary fontification even for invalid
> buffer contents --- that is, if the user types gibberish in a C-mode
> buffer, we want constructs that look like keywords and strings in that
> gibberish stream to be highlighted.

Yes, and this is a client's code that has to decide those things, be
it using only the token list to do fontification or let a higher-level
a parser do it.

>
>> I will definitely check it out, especially because it uses GLR(it
>> really does?!), which can non-trivial to implement.
>
> Wagner's thesis contains a description of a few alternative incremental
> GLR algorithms that look very promising.

Yes, and a lot more :-) I want to concentrate on a smaller problem -
don't feel like implementing the whole thesis right now.

> I have a few extensions in mind too.  It's important to be able to
> quickly fontify a particular region of the buffer --- e.g., while scrolli=
ng.
>
> If we've already built a parse tree and damage part of the buffer, we
> can repair the tree and re-fontify fairly quickly. But what if we
> haven't parsed the whole buffer yet?
>

Nice. And I will definitely need to discuss all the optimization
possibilities later. First, the core logic has to be implemented.

Bottom line: I want to take this particular narrow problem, a few user
code examples (for me it is a port of CPython's LL(1) parser) and see
if I can solve in an optimal way. A working prototype will take some
time, a month or more - I am not in a hurry.

As much as I understand, you want to cooperate on it, right..?


--=20
Yours sincerely,


Vladimir Kazanov


--
=D0=A1 =D1=83=D0=B2=D0=B0=D0=B6=D0=B5=D0=BD=D0=B8=D0=B5=D0=BC,

=D0=92=D0=BB=D0=B0=D0=B4=D0=B8=D0=BC=D0=B8=D1=80 =D0=9A=D0=B0=D0=B7=D0=B0=
=D0=BD=D0=BE=D0=B2