On 09/22/2014 03:21 AM, Vladimir Kazanov wrote: > On Mon, Sep 22, 2014 at 1:01 AM, Daniel Colascione wrote: > >> I've been working (very, very, very slowly) on similar functionality. >> The basic idea is based on the incremental lexing algorithm that Tim A. >> Wagner sets out in chapter 5 of his thesis [1]. The key is dynamically >> tracking lookahead used while we generate each token. Wagner's >> algorithm allows us to incorporate arbitrary lookahead into the >> invalidation state, so supporting something like flex's unlimited >> trailing context is no problem. >> >> The nice thing about this algorithm is that like the parser, it's an >> online algorithm and arbitrarily restartable. > > I have already mentioned Wagner's paper in the previous letters. > Actually, it is the main source of inspiration :-) But I think it is a > bit over-complicated, and the only implementation I saw (Netbean's > Lexer API) does not even try to implement it completely. Which is > okay, academic papers tend to idealize things. That Lexer is a dumb state matcher, last time I checked. So is Eclipse's. Neither is adequate, at least not if you want to support lexing *arbitrary* languages (e.g., Python and JavaScript) with guaranteed correctness in the face of arbitrary buffer modification. > You do realize that this is a client's code problem? We can only > recommend to use this or that regex engine, or even set the lookahead > value for various token types by hand; and the latter case would > probably work for most real-life cases. > > I am not even sure that it is possible to do it the Wagner's way (have > a real next_char() function) in Emacs. I would check Lexer API > solution as a starting point. Of course it's possible to implement in Emacs. Buffers are strictly more powerful than character streams. > >> Where my thing departs from flex is that I want to use a regular >> expression (in the rx sense) to describe the higher-level parsing >> automaton instead of making mode authors fiddle with start states. This >> way, it's easy to incorporate support for things like JavaScript's >> regular expression syntax, in which "/" can mean one of two tokens >> depending on the previous token. >> >> (Another way of dealing with lexical ambiguity is to let the lexer >> return an arbitrary number of tokens for a given position and let the >> GLR parser sort it out, but I'm not as happy with that solution.) > > I do not want to solve any concrete lexing problems. The whole point > is about supplying a way to do it incrementally. I do not want to know > anything about the code above or below , be it GLR/LR/flex/etc. > >> >> There are two stages here: you want in *some* cases for fontification to >> use the results of tokenization directly; in other cases, you want to >> apply fontification rules to the result of parsing that token stream. >> Splitting the fontification rules between terminals and non-terminals >> this way helps us maintain rudimentary fontification even for invalid >> buffer contents --- that is, if the user types gibberish in a C-mode >> buffer, we want constructs that look like keywords and strings in that >> gibberish stream to be highlighted. > > Yes, and this is a client's code that has to decide those things, be > it using only the token list to do fontification or let a higher-level > a parser do it. Unless the parser itself is incremental, you're going to have interactivity problems. >>> I will definitely check it out, especially because it uses GLR(it >>> really does?!), which can non-trivial to implement. >> >> Wagner's thesis contains a description of a few alternative incremental >> GLR algorithms that look very promising. > > Yes, and a lot more :-) I want to concentrate on a smaller problem - > don't feel like implementing the whole thesis right now. > >> I have a few extensions in mind too. It's important to be able to >> quickly fontify a particular region of the buffer --- e.g., while scrolling. >> >> If we've already built a parse tree and damage part of the buffer, we >> can repair the tree and re-fontify fairly quickly. But what if we >> haven't parsed the whole buffer yet? >> > > Nice. And I will definitely need to discuss all the optimization > possibilities later. First, the core logic has to be implemented. > > Bottom line: I want to take this particular narrow problem, a few user > code examples (for me it is a port of CPython's LL(1) parser) and see > if I can solve in an optimal way. A working prototype will take some > time, a month or more - I am not in a hurry. > > As much as I understand, you want to cooperate on it, right..? *sigh* It sounds like you want to create something simple. You'll run into the same problems I did, or you'll produce something less than fully general. I don't have enough time to work on something that isn't fully general. I'm sick of writing language-specific text parsing code. Have fun.