From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Vladimir Kazanov Newsgroups: gmane.emacs.devel Subject: Re: Tokenizing Date: Mon, 22 Sep 2014 13:21:47 +0300 Message-ID: References: <85ha01dm5u.fsf@stephe-leake.org> <541F4AD0.3060109@dancol.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1411381350 23436 80.91.229.3 (22 Sep 2014 10:22:30 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 22 Sep 2014 10:22:30 +0000 (UTC) Cc: Stephen Leake , emacs-devel@gnu.org To: Daniel Colascione Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Sep 22 12:22:23 2014 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1XW0lB-0003xN-PL for ged-emacs-devel@m.gmane.org; Mon, 22 Sep 2014 12:22:22 +0200 Original-Received: from localhost ([::1]:44521 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XW0lB-0000jG-1T for ged-emacs-devel@m.gmane.org; Mon, 22 Sep 2014 06:22:21 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:54516) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XW0l7-0000iz-3u for emacs-devel@gnu.org; Mon, 22 Sep 2014 06:22:18 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XW0l3-0002qu-1i for emacs-devel@gnu.org; Mon, 22 Sep 2014 06:22:17 -0400 Original-Received: from mail-ig0-x233.google.com ([2607:f8b0:4001:c05::233]:53026) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XW0l2-0002qL-Qv for emacs-devel@gnu.org; Mon, 22 Sep 2014 06:22:12 -0400 Original-Received: by mail-ig0-f179.google.com with SMTP id l13so2404121iga.6 for ; Mon, 22 Sep 2014 03:22:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:content-transfer-encoding; bh=3LkH3mhF//XMP8hWawZP2oe5GGDqXup/GbTaw7fsiPk=; b=fkEkQJvfaZhNZHEfHRTZZIEImd6wo4yKyVzj+CST2Sk76mogrMs0DL8jfT1VREAV8C IcR4jRyZ15KsEFLijVxP0xNjfnXDlOoQSfEYgsDugC0t3OKFZkXtm/pQtaucQECrTSyz b1rRurP4zY2AEESRWBVAak5u5ogWY+1u1gZhri0TWsS+rN5E2Xd8pBe89RC45DkM1thD JbjBR7ohKSeiR3C7iE4UYQHt8B9c9ISNwI1DLQLsKCrcZ9rCD1Phmi7umQi6xH3GyNd4 6DnED7BRvctViwmvdJTgNwELTLSsoIFmws3orhL39ZLxQTGumBQe/FK8LqudLkwzMhbq KTxg== X-Received: by 10.50.136.167 with SMTP id qb7mr13483835igb.31.1411381327192; Mon, 22 Sep 2014 03:22:07 -0700 (PDT) Original-Received: by 10.107.18.133 with HTTP; Mon, 22 Sep 2014 03:21:47 -0700 (PDT) In-Reply-To: <541F4AD0.3060109@dancol.org> X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2607:f8b0:4001:c05::233 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:174642 Archived-At: On Mon, Sep 22, 2014 at 1:01 AM, Daniel Colascione wrot= e: > I've been working (very, very, very slowly) on similar functionality. > The basic idea is based on the incremental lexing algorithm that Tim A. > Wagner sets out in chapter 5 of his thesis [1]. The key is dynamically > tracking lookahead used while we generate each token. Wagner's > algorithm allows us to incorporate arbitrary lookahead into the > invalidation state, so supporting something like flex's unlimited > trailing context is no problem. > > The nice thing about this algorithm is that like the parser, it's an > online algorithm and arbitrarily restartable. I have already mentioned Wagner's paper in the previous letters. Actually, it is the main source of inspiration :-) But I think it is a bit over-complicated, and the only implementation I saw (Netbean's Lexer API) does not even try to implement it completely. Which is okay, academic papers tend to idealize things. > We can't use the built-in Emacs regular expression engine for this > facility because there's no way to tell regex.c to record how much > lookahead it used, and even if there were, I'd rather not rely on its > backtracking matcher. I've written a DFA-based regex matcher that should > help in implementing this algorithm; of course, a DFA matcher has a > lookahead of zero. > > Mine supports zero-width predicates though, so we can actually use > achieve nonzero lookahead (and lookbehind!) if needed. You do realize that this is a client's code problem? We can only recommend to use this or that regex engine, or even set the lookahead value for various token types by hand; and the latter case would probably work for most real-life cases. I am not even sure that it is possible to do it the Wagner's way (have a real next_char() function) in Emacs. I would check Lexer API solution as a starting point. > Where my thing departs from flex is that I want to use a regular > expression (in the rx sense) to describe the higher-level parsing > automaton instead of making mode authors fiddle with start states. This > way, it's easy to incorporate support for things like JavaScript's > regular expression syntax, in which "/" can mean one of two tokens > depending on the previous token. > > (Another way of dealing with lexical ambiguity is to let the lexer > return an arbitrary number of tokens for a given position and let the > GLR parser sort it out, but I'm not as happy with that solution.) I do not want to solve any concrete lexing problems. The whole point is about supplying a way to do it incrementally. I do not want to know anything about the code above or below , be it GLR/LR/flex/etc. > > There are two stages here: you want in *some* cases for fontification to > use the results of tokenization directly; in other cases, you want to > apply fontification rules to the result of parsing that token stream. > Splitting the fontification rules between terminals and non-terminals > this way helps us maintain rudimentary fontification even for invalid > buffer contents --- that is, if the user types gibberish in a C-mode > buffer, we want constructs that look like keywords and strings in that > gibberish stream to be highlighted. Yes, and this is a client's code that has to decide those things, be it using only the token list to do fontification or let a higher-level a parser do it. > >> I will definitely check it out, especially because it uses GLR(it >> really does?!), which can non-trivial to implement. > > Wagner's thesis contains a description of a few alternative incremental > GLR algorithms that look very promising. Yes, and a lot more :-) I want to concentrate on a smaller problem - don't feel like implementing the whole thesis right now. > I have a few extensions in mind too. It's important to be able to > quickly fontify a particular region of the buffer --- e.g., while scrolli= ng. > > If we've already built a parse tree and damage part of the buffer, we > can repair the tree and re-fontify fairly quickly. But what if we > haven't parsed the whole buffer yet? > Nice. And I will definitely need to discuss all the optimization possibilities later. First, the core logic has to be implemented. Bottom line: I want to take this particular narrow problem, a few user code examples (for me it is a port of CPython's LL(1) parser) and see if I can solve in an optimal way. A working prototype will take some time, a month or more - I am not in a hurry. As much as I understand, you want to cooperate on it, right..? --=20 Yours sincerely, Vladimir Kazanov -- =D0=A1 =D1=83=D0=B2=D0=B0=D0=B6=D0=B5=D0=BD=D0=B8=D0=B5=D0=BC, =D0=92=D0=BB=D0=B0=D0=B4=D0=B8=D0=BC=D0=B8=D1=80 =D0=9A=D0=B0=D0=B7=D0=B0= =D0=BD=D0=BE=D0=B2