From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Daniel Colascione Newsgroups: gmane.emacs.devel Subject: Re: Tokenizing Date: Mon, 22 Sep 2014 06:55:00 -0700 Message-ID: <54202A34.6050206@dancol.org> References: <85ha01dm5u.fsf@stephe-leake.org> <541F4AD0.3060109@dancol.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="DrcRsb76bhRag4wwn0xbeQiFg0nBThrTf" X-Trace: ger.gmane.org 1411395305 13871 80.91.229.3 (22 Sep 2014 14:15:05 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 22 Sep 2014 14:15:05 +0000 (UTC) Cc: Stephen Leake , emacs-devel@gnu.org To: Vladimir Kazanov Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Sep 22 16:14:59 2014 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1XW4OI-0002hE-99 for ged-emacs-devel@m.gmane.org; Mon, 22 Sep 2014 16:14:58 +0200 Original-Received: from localhost ([::1]:46909 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XW4OH-0001Bw-Pb for ged-emacs-devel@m.gmane.org; Mon, 22 Sep 2014 10:14:57 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:41587) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XW4O9-00019x-4g for emacs-devel@gnu.org; Mon, 22 Sep 2014 10:14:55 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XW4O2-00036Y-Ok for emacs-devel@gnu.org; Mon, 22 Sep 2014 10:14:49 -0400 Original-Received: from dancol.org ([2600:3c01::f03c:91ff:fedf:adf3]:58080) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XW4O2-00035g-HB for emacs-devel@gnu.org; Mon, 22 Sep 2014 10:14:42 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=dancol.org; s=x; h=Content-Type:In-Reply-To:References:Subject:CC:To:MIME-Version:From:Date:Message-ID; bh=tTaeiEMMJMQkoV4aTMMz3d68t+Ds/jnNl2b2RRFeuFA=; b=NsqcHXQz7LJ7qcFBdhKObtMt1wVDs0sqHGeAx+JRCeKdXa7geh9KFslsea358CdjInKMZMKl6IGOYs4zG1Xl7xArFBsM+kHNNrOBQWrO9NeTPp5BbDiCNgNXMzK34TkwMJHS4PANM12Fb+2Ev7dlprjbn7G8Uw0vOjxtxsj1AzSmOIdT5OPzq21qVDYF/KfaoB9ApkYmTc5nEi2K9LEnqL3XCWEGepngbryanoJ6tsBSXEdDmpdsOt0iwTKhh8mGswb+oag48nQvnHEV9wv7wcUV5sJrVFhK+cPp1Umg9LcFbHWXONcN7Owaxqr8snGHaIzVuHJ8SNWjheTRgX/Zfg==; Original-Received: from [166.170.49.116] (helo=[192.168.137.148]) by dancol.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.84_RC2) (envelope-from ) id 1XW455-0008Ae-Dw; Mon, 22 Sep 2014 06:55:07 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1 In-Reply-To: X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2600:3c01::f03c:91ff:fedf:adf3 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:174647 Archived-At: This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --DrcRsb76bhRag4wwn0xbeQiFg0nBThrTf Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 09/22/2014 03:21 AM, Vladimir Kazanov wrote: > On Mon, Sep 22, 2014 at 1:01 AM, Daniel Colascione = wrote: >=20 >> I've been working (very, very, very slowly) on similar functionality. >> The basic idea is based on the incremental lexing algorithm that Tim A= =2E >> Wagner sets out in chapter 5 of his thesis [1]. The key is dynamicall= y >> tracking lookahead used while we generate each token. Wagner's >> algorithm allows us to incorporate arbitrary lookahead into the >> invalidation state, so supporting something like flex's unlimited >> trailing context is no problem. >> >> The nice thing about this algorithm is that like the parser, it's an >> online algorithm and arbitrarily restartable. >=20 > I have already mentioned Wagner's paper in the previous letters. > Actually, it is the main source of inspiration :-) But I think it is a > bit over-complicated, and the only implementation I saw (Netbean's > Lexer API) does not even try to implement it completely. Which is > okay, academic papers tend to idealize things. That Lexer is a dumb state matcher, last time I checked. So is Eclipse's. Neither is adequate, at least not if you want to support lexing *arbitrary* languages (e.g., Python and JavaScript) with guaranteed correctness in the face of arbitrary buffer modification. > You do realize that this is a client's code problem? We can only > recommend to use this or that regex engine, or even set the lookahead > value for various token types by hand; and the latter case would > probably work for most real-life cases. >=20 > I am not even sure that it is possible to do it the Wagner's way (have > a real next_char() function) in Emacs. I would check Lexer API > solution as a starting point. Of course it's possible to implement in Emacs. Buffers are strictly more powerful than character streams. >=20 >> Where my thing departs from flex is that I want to use a regular >> expression (in the rx sense) to describe the higher-level parsing >> automaton instead of making mode authors fiddle with start states. Th= is >> way, it's easy to incorporate support for things like JavaScript's >> regular expression syntax, in which "/" can mean one of two tokens >> depending on the previous token. >> >> (Another way of dealing with lexical ambiguity is to let the lexer >> return an arbitrary number of tokens for a given position and let the >> GLR parser sort it out, but I'm not as happy with that solution.) >=20 > I do not want to solve any concrete lexing problems. The whole point > is about supplying a way to do it incrementally. I do not want to know > anything about the code above or below , be it GLR/LR/flex/etc. >=20 >> >> There are two stages here: you want in *some* cases for fontification = to >> use the results of tokenization directly; in other cases, you want to >> apply fontification rules to the result of parsing that token stream. >> Splitting the fontification rules between terminals and non-terminals >> this way helps us maintain rudimentary fontification even for invalid >> buffer contents --- that is, if the user types gibberish in a C-mode >> buffer, we want constructs that look like keywords and strings in that= >> gibberish stream to be highlighted. >=20 > Yes, and this is a client's code that has to decide those things, be > it using only the token list to do fontification or let a higher-level > a parser do it. Unless the parser itself is incremental, you're going to have interactivity problems. >>> I will definitely check it out, especially because it uses GLR(it >>> really does?!), which can non-trivial to implement. >> >> Wagner's thesis contains a description of a few alternative incrementa= l >> GLR algorithms that look very promising. >=20 > Yes, and a lot more :-) I want to concentrate on a smaller problem - > don't feel like implementing the whole thesis right now. >=20 >> I have a few extensions in mind too. It's important to be able to >> quickly fontify a particular region of the buffer --- e.g., while scro= lling. >> >> If we've already built a parse tree and damage part of the buffer, we >> can repair the tree and re-fontify fairly quickly. But what if we >> haven't parsed the whole buffer yet? >> >=20 > Nice. And I will definitely need to discuss all the optimization > possibilities later. First, the core logic has to be implemented. >=20 > Bottom line: I want to take this particular narrow problem, a few user > code examples (for me it is a port of CPython's LL(1) parser) and see > if I can solve in an optimal way. A working prototype will take some > time, a month or more - I am not in a hurry. >=20 > As much as I understand, you want to cooperate on it, right..? *sigh* It sounds like you want to create something simple. You'll run into the same problems I did, or you'll produce something less than fully general. I don't have enough time to work on something that isn't fully general. I'm sick of writing language-specific text parsing code. Have fun. --DrcRsb76bhRag4wwn0xbeQiFg0nBThrTf Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAEBCAAGBQJUICo0AAoJEN4WImmbpWBl3AoP/jDxLfeOJAwuo1kc9qON43ST +f7B4kjFcG7KYun5QyGl4mqnx4asZXwty8D5ksyLcMOhfanAORYG0pda1E2R2W6u X3yBRQ5dhjD5noABpnLZuRFeSij55nkXUJ211l5QDJh0E0bRQmhnTa7ldzxb9nQ5 wSFnUwIuPwYfm7cvPKQfZNgmrjBThSjMJUJ0CNdjQwCJO0v1zMRi3ZLKukIDKH11 qdGXRTvdU1Y0eUGUV7XthHOQmOgUEEq8LwOLwiPhTugOlIayQbevyrf9i2b4lnM3 e7RBD3ioVpPztyG8IkmV8XxLrIJfR1wB5UTQHfLwYaqPYuiO3RXYvoZ+Kjkhu+vz PkYnxu06+WLyuUm7klCmjcnifezywNyyInqwlgS7ee7epMCSh3W9coYiyRb59KU/ aVFDNC6tl1Sc8BcrO6OBC68yojqvZnVN5zt10z6tcgZhRZBMwTlH4GZ6R5kWidhS q0CBhiD2dnsDvIU/WAN9uZewgYfDpufumEI5uehnPwFGdNf8Xt4/tz583vr64Cjo rHcKLvX5TPRPDFyyc682Ic5uB/wzZ5MPMuG8RplQNKsV2g8F3sNVNpUTGjmEASGG FiJZp2yOIxkJu07uZ7sIDTBNGy49zXI1Cu2rf7lUEdtxoh18viIx/Xlcv/vH83Pn rGfHb1aDX96+nMyzjz6u =1c33 -----END PGP SIGNATURE----- --DrcRsb76bhRag4wwn0xbeQiFg0nBThrTf--