From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Daniel Colascione <dancol@dancol.org>
Newsgroups: gmane.emacs.devel
Subject: Re: Tokenizing
Date: Mon, 22 Sep 2014 06:55:00 -0700
Message-ID: <54202A34.6050206@dancol.org>
References: <CAAs=0-1-1KCMOvWm7+kt8Ddiu1BZznoUpB2SVFAM_fV3zNVE8A@mail.gmail.com>
	<E1XV2XA-0002kq-7t@fencepost.gnu.org>
	<CAAs=0-0Jr7kiJuqmG_sxrO6VAMf7NEoJ-FAed-ZNE0H5srBJ8A@mail.gmail.com>
	<E1XVKb1-0006SZ-Un@fencepost.gnu.org>
	<CAAs=0-05L9nV69HPWTfEgtEq_x8LiGivmCo7vOJEr5dDJMVCsg@mail.gmail.com>
	<85ha01dm5u.fsf@stephe-leake.org>
	<CAAs=0-0jNfgkBVoQaV+GWBRCu=-yLteJU83+4EbVin4a5fqTGg@mail.gmail.com>
	<541F4AD0.3060109@dancol.org>
	<CAAs=0-3grY7DYkQDr9vbccp2iTUzkJWnFwSmOmu0nhSVB4ZzxQ@mail.gmail.com>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha256;
	protocol="application/pgp-signature";
	boundary="DrcRsb76bhRag4wwn0xbeQiFg0nBThrTf"
X-Trace: ger.gmane.org 1411395305 13871 80.91.229.3 (22 Sep 2014 14:15:05 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Mon, 22 Sep 2014 14:15:05 +0000 (UTC)
Cc: Stephen Leake <stephen_leake@stephe-leake.org>, emacs-devel@gnu.org
To: Vladimir Kazanov <vekazanov@gmail.com>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Sep 22 16:14:59 2014
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1XW4OI-0002hE-99
	for ged-emacs-devel@m.gmane.org; Mon, 22 Sep 2014 16:14:58 +0200
Original-Received: from localhost ([::1]:46909 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1XW4OH-0001Bw-Pb
	for ged-emacs-devel@m.gmane.org; Mon, 22 Sep 2014 10:14:57 -0400
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:41587)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <dancol@dancol.org>) id 1XW4O9-00019x-4g
	for emacs-devel@gnu.org; Mon, 22 Sep 2014 10:14:55 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <dancol@dancol.org>) id 1XW4O2-00036Y-Ok
	for emacs-devel@gnu.org; Mon, 22 Sep 2014 10:14:49 -0400
Original-Received: from dancol.org ([2600:3c01::f03c:91ff:fedf:adf3]:58080)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <dancol@dancol.org>) id 1XW4O2-00035g-HB
	for emacs-devel@gnu.org; Mon, 22 Sep 2014 10:14:42 -0400
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=dancol.org;
	s=x; 
	h=Content-Type:In-Reply-To:References:Subject:CC:To:MIME-Version:From:Date:Message-ID;
	bh=tTaeiEMMJMQkoV4aTMMz3d68t+Ds/jnNl2b2RRFeuFA=; 
	b=NsqcHXQz7LJ7qcFBdhKObtMt1wVDs0sqHGeAx+JRCeKdXa7geh9KFslsea358CdjInKMZMKl6IGOYs4zG1Xl7xArFBsM+kHNNrOBQWrO9NeTPp5BbDiCNgNXMzK34TkwMJHS4PANM12Fb+2Ev7dlprjbn7G8Uw0vOjxtxsj1AzSmOIdT5OPzq21qVDYF/KfaoB9ApkYmTc5nEi2K9LEnqL3XCWEGepngbryanoJ6tsBSXEdDmpdsOt0iwTKhh8mGswb+oag48nQvnHEV9wv7wcUV5sJrVFhK+cPp1Umg9LcFbHWXONcN7Owaxqr8snGHaIzVuHJ8SNWjheTRgX/Zfg==;
Original-Received: from [166.170.49.116] (helo=[192.168.137.148])
	by dancol.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
	(Exim 4.84_RC2) (envelope-from <dancol@dancol.org>)
	id 1XW455-0008Ae-Dw; Mon, 22 Sep 2014 06:55:07 -0700
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:31.0) Gecko/20100101 Thunderbird/31.1.1
In-Reply-To: <CAAs=0-3grY7DYkQDr9vbccp2iTUzkJWnFwSmOmu0nhSVB4ZzxQ@mail.gmail.com>
X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address
	(bad octet value).
X-Received-From: 2600:3c01::f03c:91ff:fedf:adf3
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:174647
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/174647>

This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
--DrcRsb76bhRag4wwn0xbeQiFg0nBThrTf
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

On 09/22/2014 03:21 AM, Vladimir Kazanov wrote:
> On Mon, Sep 22, 2014 at 1:01 AM, Daniel Colascione <dancol@dancol.org> =
wrote:
>=20
>> I've been working (very, very, very slowly) on similar functionality.
>> The basic idea is based on the incremental lexing algorithm that Tim A=
=2E
>> Wagner sets out in chapter 5 of his thesis [1].  The key is dynamicall=
y
>> tracking lookahead used while we generate each token.  Wagner's
>> algorithm allows us to incorporate arbitrary lookahead into the
>> invalidation state, so supporting something like flex's unlimited
>> trailing context is no problem.
>>
>> The nice thing about this algorithm is that like the parser, it's an
>> online algorithm and arbitrarily restartable.
>=20
> I have already mentioned Wagner's paper in the previous letters.
> Actually, it is the main source of inspiration :-) But I think it is a
> bit over-complicated, and the only implementation I saw (Netbean's
> Lexer API) does not even try to implement it completely. Which is
> okay, academic papers tend to idealize things.

That Lexer is a dumb state matcher, last time I checked. So is
Eclipse's. Neither is adequate, at least not if you want to support
lexing *arbitrary* languages (e.g., Python and JavaScript) with
guaranteed correctness in the face of arbitrary buffer modification.

> You do realize that this is a client's code problem? We can only
> recommend to use this or that regex engine, or even set the lookahead
> value for various token types by hand; and the latter case would
> probably work for most real-life cases.
>=20
> I am not even sure that it is possible to do it the Wagner's way (have
> a real next_char() function) in Emacs. I would check Lexer API
> solution as a starting point.

Of course it's possible to implement in Emacs. Buffers are strictly more
powerful than character streams.

>=20
>> Where my thing departs from flex is that I want to use a regular
>> expression (in the rx sense) to describe the higher-level parsing
>> automaton instead of making mode authors fiddle with start states.  Th=
is
>> way, it's easy to incorporate support for things like JavaScript's
>> regular expression syntax, in which "/" can mean one of two tokens
>> depending on the previous token.
>>
>> (Another way of dealing with lexical ambiguity is to let the lexer
>> return an arbitrary number of tokens for a given position and let the
>> GLR parser sort it out, but I'm not as happy with that solution.)
>=20
> I do not want to solve any concrete lexing problems. The whole point
> is about supplying a way to do it incrementally. I do not want to know
> anything about the code above or below , be it GLR/LR/flex/etc.
>=20
>>
>> There are two stages here: you want in *some* cases for fontification =
to
>> use the results of tokenization directly; in other cases, you want to
>> apply fontification rules to the result of parsing that token stream.
>> Splitting the fontification rules between terminals and non-terminals
>> this way helps us maintain rudimentary fontification even for invalid
>> buffer contents --- that is, if the user types gibberish in a C-mode
>> buffer, we want constructs that look like keywords and strings in that=

>> gibberish stream to be highlighted.
>=20
> Yes, and this is a client's code that has to decide those things, be
> it using only the token list to do fontification or let a higher-level
> a parser do it.

Unless the parser itself is incremental, you're going to have
interactivity problems.

>>> I will definitely check it out, especially because it uses GLR(it
>>> really does?!), which can non-trivial to implement.
>>
>> Wagner's thesis contains a description of a few alternative incrementa=
l
>> GLR algorithms that look very promising.
>=20
> Yes, and a lot more :-) I want to concentrate on a smaller problem -
> don't feel like implementing the whole thesis right now.
>=20
>> I have a few extensions in mind too.  It's important to be able to
>> quickly fontify a particular region of the buffer --- e.g., while scro=
lling.
>>
>> If we've already built a parse tree and damage part of the buffer, we
>> can repair the tree and re-fontify fairly quickly. But what if we
>> haven't parsed the whole buffer yet?
>>
>=20
> Nice. And I will definitely need to discuss all the optimization
> possibilities later. First, the core logic has to be implemented.
>=20
> Bottom line: I want to take this particular narrow problem, a few user
> code examples (for me it is a port of CPython's LL(1) parser) and see
> if I can solve in an optimal way. A working prototype will take some
> time, a month or more - I am not in a hurry.
>=20
> As much as I understand, you want to cooperate on it, right..?

*sigh* It sounds like you want to create something simple. You'll run
into the same problems I did, or you'll produce something less than
fully general. I don't have enough time to work on something that isn't
fully general. I'm sick of writing language-specific text parsing code.

Have fun.


--DrcRsb76bhRag4wwn0xbeQiFg0nBThrTf
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQIcBAEBCAAGBQJUICo0AAoJEN4WImmbpWBl3AoP/jDxLfeOJAwuo1kc9qON43ST
+f7B4kjFcG7KYun5QyGl4mqnx4asZXwty8D5ksyLcMOhfanAORYG0pda1E2R2W6u
X3yBRQ5dhjD5noABpnLZuRFeSij55nkXUJ211l5QDJh0E0bRQmhnTa7ldzxb9nQ5
wSFnUwIuPwYfm7cvPKQfZNgmrjBThSjMJUJ0CNdjQwCJO0v1zMRi3ZLKukIDKH11
qdGXRTvdU1Y0eUGUV7XthHOQmOgUEEq8LwOLwiPhTugOlIayQbevyrf9i2b4lnM3
e7RBD3ioVpPztyG8IkmV8XxLrIJfR1wB5UTQHfLwYaqPYuiO3RXYvoZ+Kjkhu+vz
PkYnxu06+WLyuUm7klCmjcnifezywNyyInqwlgS7ee7epMCSh3W9coYiyRb59KU/
aVFDNC6tl1Sc8BcrO6OBC68yojqvZnVN5zt10z6tcgZhRZBMwTlH4GZ6R5kWidhS
q0CBhiD2dnsDvIU/WAN9uZewgYfDpufumEI5uehnPwFGdNf8Xt4/tz583vr64Cjo
rHcKLvX5TPRPDFyyc682Ic5uB/wzZ5MPMuG8RplQNKsV2g8F3sNVNpUTGjmEASGG
FiJZp2yOIxkJu07uZ7sIDTBNGy49zXI1Cu2rf7lUEdtxoh18viIx/Xlcv/vH83Pn
rGfHb1aDX96+nMyzjz6u
=1c33
-----END PGP SIGNATURE-----

--DrcRsb76bhRag4wwn0xbeQiFg0nBThrTf--