From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Daniel Colascione Newsgroups: gmane.emacs.devel Subject: Re: Tokenizing Date: Sun, 21 Sep 2014 15:01:52 -0700 Message-ID: <541F4AD0.3060109@dancol.org> References: <85ha01dm5u.fsf@stephe-leake.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="nVTQlbRT3x4DOB3lEISAOfxOf91B17hnO" X-Trace: ger.gmane.org 1411336958 17285 80.91.229.3 (21 Sep 2014 22:02:38 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 21 Sep 2014 22:02:38 +0000 (UTC) Cc: emacs-devel@gnu.org To: Vladimir Kazanov , Stephen Leake Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Sep 22 00:02:34 2014 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1XVpDE-0006Nd-AI for ged-emacs-devel@m.gmane.org; Mon, 22 Sep 2014 00:02:32 +0200 Original-Received: from localhost ([::1]:41104 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XVpDE-0002Jn-0C for ged-emacs-devel@m.gmane.org; Sun, 21 Sep 2014 18:02:32 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:58323) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XVpCv-0002JW-IX for emacs-devel@gnu.org; Sun, 21 Sep 2014 18:02:17 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XVpCr-00027j-52 for emacs-devel@gnu.org; Sun, 21 Sep 2014 18:02:13 -0400 Original-Received: from dancol.org ([2600:3c01::f03c:91ff:fedf:adf3]:54148) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XVpCq-00026t-Or for emacs-devel@gnu.org; Sun, 21 Sep 2014 18:02:09 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=dancol.org; s=x; h=Content-Type:In-Reply-To:References:Subject:CC:To:MIME-Version:From:Date:Message-ID; bh=hr+zIYmIBdEvPXGmr5v3T54xX3QBYEMqdErEdwWNdac=; b=FaLYCMlqaidO0Fg+1dP6VzwNGSSIBsyt54mdQBOof43KINzHjP/UziGnhp0n30SXSGHM08kslBVjL5OWjOln9bgRS/JzKkwAndFLnyeLWHtvXNnLzLzeRdGjgmkDi9IF6+sfEZhljZ1i7k8/C6bm6W/d0Lms2TY6y4ZhbOdL82K7F6UabOWGm2R+ygF6kuZEXzTFkyJHE9k0XBHhHQ58IsMDeu1oxx/FxFy/hRC2ccMC+HU7GD09oACS7DVv5GV6jccGr+yFzW3jJc0yvqGZ7piwIQ5XsAcj50gGgsF1MI7DjgGZdFx2vQl2/UT8KcPS8ykmiu03r5iK897nXa1usg==; Original-Received: from [24.16.210.95] (helo=[192.168.1.50]) by dancol.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.84_RC2) (envelope-from ) id 1XVpCf-0003k1-0S; Sun, 21 Sep 2014 15:01:57 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1 In-Reply-To: X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2600:3c01::f03c:91ff:fedf:adf3 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:174629 Archived-At: This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --nVTQlbRT3x4DOB3lEISAOfxOf91B17hnO Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 09/21/2014 11:55 AM, Vladimir Kazanov wrote: >> I don't normally edit 7000 line files, so the Ada mode parsing delay i= s >> not noticeable to me, so I prefer the current Ada mode approach of not= >> using the idle timer to trigger a parse. But it could be a user option= =2E >> >=20 > I will look into that. Although the main idea is to *keep the token > list consistent* most of the time. There will definitely be > customization possibilities. I've been working (very, very, very slowly) on similar functionality. The basic idea is based on the incremental lexing algorithm that Tim A. Wagner sets out in chapter 5 of his thesis [1]. The key is dynamically tracking lookahead used while we generate each token. Wagner's algorithm allows us to incorporate arbitrary lookahead into the invalidation state, so supporting something like flex's unlimited trailing context is no problem. The nice thing about this algorithm is that like the parser, it's an online algorithm and arbitrarily restartable. We can't use the built-in Emacs regular expression engine for this facility because there's no way to tell regex.c to record how much lookahead it used, and even if there were, I'd rather not rely on its backtracking matcher. I've written a DFA-based regex matcher that should help in implementing this algorithm; of course, a DFA matcher has a lookahead of zero. Mine supports zero-width predicates though, so we can actually use achieve nonzero lookahead (and lookbehind!) if needed. Where my thing departs from flex is that I want to use a regular expression (in the rx sense) to describe the higher-level parsing automaton instead of making mode authors fiddle with start states. This way, it's easy to incorporate support for things like JavaScript's regular expression syntax, in which "/" can mean one of two tokens depending on the previous token. (Another way of dealing with lexical ambiguity is to let the lexer return an arbitrary number of tokens for a given position and let the GLR parser sort it out, but I'm not as happy with that solution.) >> Ada mode uses text properties to store parse results; the tokenizer >> results are part of that, but are not stored separately. I don't see >> much point in separating the tokenizer from the parser; the tokenizer >> results are not useful by themselves (at least, not in Ada mode). >> >=20 > First, this not quite right. Tokenization results can be used, for > example, for granular syntax highlighting. Font Lock basically just > uses regexps to catch something that looks like > comments/keywords/whatever. Tokenizer already *knows* for sure what it > found. And you don't have to build a full parser to use the results. There are two stages here: you want in *some* cases for fontification to use the results of tokenization directly; in other cases, you want to apply fontification rules to the result of parsing that token stream. Splitting the fontification rules between terminals and non-terminals this way helps us maintain rudimentary fontification even for invalid buffer contents --- that is, if the user types gibberish in a C-mode buffer, we want constructs that look like keywords and strings in that gibberish stream to be highlighted. > Second, it not a tokenizer I want to build, there is a > misunderstanding of sorts. It is a helper mode (similar to Font Lock, > in a way) for keeping token lists up to date all the time, easy and > fast. User code - the tokenizer itself - will just have to provide an > interface to the mode (be restartable and supply required restart > information in resulting tokens). The mode will use the information to > avoid extra tokenizing. Indeed. I don't imagine replacing font-lock per se, but instead just implementing a font-lock matcher that *lazily* brings the token list up-to-date and pulls keywords out of it for fontification. It's important to keep font-lock around for backward compatibility, but there's no reason that font-lock's data source has to be a bunch of simple regex matches. cc-mode, for example, uses font-lock to apply highlights that come from a parser that looks oddly like a CYK parser. (We start parsing at all points in the buffer that *might* start a declaration; there's also some limited, ad-hoc backtracking.) >> I have not noticed any problems with the text properties interface; in= >> particular, storing and retrieving text properties is fast compared to= >> parsing. Ada mode stores about two parse result text properties per >> source line on average. >=20 > I did not know about your mode - and parsers are sort of my hobby :-) Mine too. :-) > I will definitely check it out, especially because it uses GLR(it > really does?!), which can non-trivial to implement. Wagner's thesis contains a description of a few alternative incremental GLR algorithms that look very promising. I have a few extensions in mind too. It's important to be able to quickly fontify a particular region of the buffer --- e.g., while scrolli= ng. If we've already built a parse tree and damage part of the buffer, we can repair the tree and re-fontify fairly quickly. But what if we haven't parsed the whole buffer yet? We can start parsing at some region we quickly determine is semantically equivalent to the start of the buffer using something like font-lock-beginning-of-syntax-function. But what about completing the parse? Well, it's possible to determine whether any particular buffer state is being parsed "dynamically" (the GSS is split) or "statically" (the GSS has one head). In practice, ambiguous regions are short, so we only have to worry about the latter case. Here, we can reliably partially parse the buffer by parsing forward (remember, GLR parsing is an online algorithm) until the parsing limit is after the visible portion of the buffer; we can then use a "shortest-to-completion" production (which we can pre-compute from the grammar) to insert virtual tokens that complete the language. If we were parsing C++, we'd be able to automatically insert virtual "}" tokens and complete the parse tree, giving us a partial parse tree we can then use for fontification. Of course, it's better to have a parse tree that covers the whole buffer, but we can compute that in the background. One really nice property of GLR parsing (and LR parsers generally) is that we can trivially pause and restart them, so we can parse in the background with very little impact on user activity, using `while-no-input' to back out to the event loop when we detect we have something more important to do. Error recovery is a challenge --- we want to do something useful on buffers that are temporarily invalided by user edits. One of the biggest challenges is dealing with mismatched brace constructs, since one missing "}" in C-mode can cause the entire rest of the buffer to no longer conform to the grammar. Bridge parsing is particularly interesting here: the basic idea is that we perform one pass over the buffer using a simplified grammar that looks only at nesting constructs, insert virtual braces to make everything balance, and then apply the usual parsing algorithm (with ad-hoc error recovery) to the result. I haven't tried it, but it apparently works pretty well. I like what SMIE does, but I don't think it's powerful enough to parse C++. GLR parsers give us all the power we need. GLR parsers also leave us with a parse forest instead of a parse tree, and we really need a parse *tree* if we're going to apply fontification rules specified in terms of AST nodes. We need to prune the parse forest down to a parse tree before using it for fontification. I actually have a pretty neat idea for doing that: we can attach assertions to parts of the parse forest and prune the tree by deleting alternatives that contradict what we know. cc-mode already does a very limited form of this analysis by recording types it sees during parsing. Consider the most vexing parse: type variable(function()) We don't know whether that's a function declaration or a variable declaration. In GLR-land, we generate a parse forest with branches for both options. If later in the buffer we see something that looks like thi= s: variable->foo(); That we know that "variable" cannot be a function, so we can prune the part of the parse forest that assumes that "variable" is a function, leaving us with an unambiguous parse tree. This way, we should be able to come up with reasonable parses for C++ buffers without having an IDE-style global view of types declarations, instead assuming that programmers, most of the time, don't contradict themselves. [1] harmonia.cs.berkeley.edu/papers/twagner-thesis.pdf [2] fileadmin.cs.lth.se/cs/Personal/Emma_Soderberg/docs/SLE08pres.pdf --nVTQlbRT3x4DOB3lEISAOfxOf91B17hnO Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAEBCAAGBQJUH0rQAAoJEN4WImmbpWBlvA4P/3+5UgJNxhP0i4vI+ItxVvyF GVzHzvUmcCS/68dWMCfstMJuBTexbaiwBwa5KG3TH1mj1i9w5XyOaTjHrZ6UxrCz xr/xWfoS0Szk6vHNKHZxeWHkQVcAERJOUowx4bvYukSjcsFF9/nNJhqssZqOXUMc UeyI5tfESwKfafxMxDR5g4j0qojGypiu3dF7eMRrkGP17mYSsQBy6/87/9VjdG7x +RFrdf1QIo1KPzn4LMENJr5ST3hxDSVxFHeqM0TRKuGuDb0iZS/w6Y25zPKx0G+3 0pKCUUF6uUwpf2TgN/6WJwnYohfxxwQVXvkhkx+0bZJsoATWIvwns8ZxiQKlGQk6 /152JmzhvRejPa7WMJIIZZnk1YDkxJ/QJu/3nYolgGF7HmprDhSAp5uxSydmUHiR F/3gn35ysDPm6Otfy8enKEK1IoaBslfR1MBBdCC4iziVZhIZhGX0Ug60TgiJK9bu 9qfVS0TpgnU2PRzivQ2kdwmWoZuR1L8XLPVEP0ivW00BdN1kBupMiH6kbpxgLIDJ /Z7e55Y8vPnvmwyxsQjPkipp48GKJMCUYuqSx/n+NhFgzz87nsQJSifbzrbI1RZt Qq4VS6EzYgXawGAFrmrYJoq0Y5lgZUzj5i/eBSWMoK5if1IRZHYefp+/+dvf1fcy vuO/4IeDUtElrTOk+Nbt =PetC -----END PGP SIGNATURE----- --nVTQlbRT3x4DOB3lEISAOfxOf91B17hnO--