From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Stefan Monnier Newsgroups: gmane.emacs.help Subject: Re: State-machine based syntax highlighting Date: Fri, 08 Dec 2006 21:06:11 -0500 Message-ID: References: <1165472049.496117.320630@79g2000cws.googlegroups.com> <1165488825.132862.189340@79g2000cws.googlegroups.com> <1165492567.864982.59980@79g2000cws.googlegroups.com> <1165495364.560960.271250@f1g2000cwa.googlegroups.com> <1165501630.172348.157180@j72g2000cwa.googlegroups.com> <1165502373.932709.15860@79g2000cws.googlegroups.com> <1165510932.276718.251220@73g2000cwn.googlegroups.com> <1165516558.657188.21610@j44g2000cwa.googlegroups.com> <1165517838.526624.171950@f1g2000cwa.googlegroups.com> <1165563183.140436.326110@16g2000cwy.googlegroups.com> <87psaukdu1.fsf@lion.rapttech.com.au> <1165567010.036350.185800@n67g2000cwd.googlegroups.com> <1165594639.164156.286660@f1g2000cwa.googlegroups.com> NNTP-Posting-Host: dough.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Trace: sea.gmane.org 1165632037 5942 80.91.229.10 (9 Dec 2006 02:40:37 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Sat, 9 Dec 2006 02:40:37 +0000 (UTC) Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Sat Dec 09 03:40:37 2006 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by dough.gmane.org with esmtp (Exim 4.50) id 1Gss8h-0005gS-C7 for geh-help-gnu-emacs@m.gmane.org; Sat, 09 Dec 2006 03:40:35 +0100 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Gss8g-0000ZL-QK for geh-help-gnu-emacs@m.gmane.org; Fri, 08 Dec 2006 21:40:34 -0500 Original-Path: shelby.stanford.edu!newsfeed.stanford.edu!postnews.google.com!news4.google.com!border1.nntp.dca.giganews.com!nntp.giganews.com!local01.nntp.dca.giganews.com!nntp.umontreal.ca!news.umontreal.ca.POSTED!not-for-mail Original-NNTP-Posting-Date: Fri, 08 Dec 2006 20:06:19 -0600 Original-Newsgroups: gnu.emacs.help User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.0.91 (gnu/linux) Cancel-Lock: sha1:wySlTIwbbDy0rE8Ax08IJGnmYP4= Original-Lines: 83 Original-NNTP-Posting-Host: 132.204.27.213 Original-X-Trace: sv3-wNz0Jw8fWP0JAiw3cdVsBi8qewwupT6JRaJK2Ge+LCB/GWYqkEjQENMPLZV16zjdefG1FUEKOI5Nqh8!KaxwIvaWqNv2iivA+F5qR5ILdwZB5gdcO82KmEUGv5fuJvODiA5KiLW7IPowksHCDFXlos5X+/dM!jdYviVR2DICpIjoTcg== Original-X-Complaints-To: abuse@umontreal.ca X-DMCA-Complaints-To: abuse@umontreal.ca X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly X-Postfilter: 1.3.32 Original-Xref: shelby.stanford.edu gnu.emacs.help:143763 Original-To: help-gnu-emacs@gnu.org X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:39366 Archived-At: > A much better strategy is to start parsing at point in each file and only > parse a screenful at a time, doing this with an external parser would be > very hard. Starting the parse "at point" can be terribly difficult since you don't know the state of the parser at that point. You can infer the state by parsing backward (this is what the indentation code does typically), or by jumping to some previous spot assumed to have some known parsing state and then parse forward. If all you care about is indentation, then parsing backward gives the best results in terms of being robust in the face of partially incorrect code (because it only parses just as far back as necessary to determine indentation, so the resulting indentation behavior has some kind of locality quality to it). If you really need the full state because you're going to keep parsing forward some arbitrary distance, then you're better off jumping back to a "safe spot" and parsing forward from there. In some languages it's not so easy to figure out what are such safe spots other than the beginning of the file. But maybe with enough parse-state caching (à la syntax-ppss in Emacs-22, although beefed up to keep the state of a real parser), and with a fast enough forward parsing code, you can get away with always parsing "from the beginning of the buffer", although it then suffers from problems when faced with invalid/misunderstood source code. Of note: parsing backward can be fiendishly difficult because languages are designed without paying any attention to it. > There are other problems. What if a part of the code is incorrect? > Imagine, in C for example, if a function were written "foo (;" on line > 10. The effect of the error would propagate down far away from where > it occurs, even line 300 might be treated wrongly. The parser would > have to cope with this eventuality. What if it's correct, but only after passing through some special purpose preprocessor? > Also, in many languages there are bits of the meaning that depend on > the names used. In C for example the code " (foo) (bar)" means > something different if foo is a type than it does if it's an > identifier. The C compiler can cope with this because it tracks all > typedefs and identifiers through not only the current file but those > included in it with #include. The only way for a font-lock system > based on a normal parser to deal with this situation would be for it to > read all the include files, which may not even be present. > Compiler parsers and font-locking/navigating code have different > intentions. Compiler parsers must be fast when handling a whole file, > and they must generate accurate error messages. Font-locking code must > be fast when starting at any arbitrary part of the code, and it must > tolerate incomplete information and errors. Note that this last point can be seen as an advantage: we don't have to detect invalid code. I believe there are two alternatives: one is the way taken by things like Visual Haskell where you integrate the editor and the build system, so the editor can run the parser with the exact same args as the compiler would/will. I think this is a very workable solution and I hope it will be developped in Emacs as well. If you decide not to integrate the editor so closely with the build system (i.e. follow the way Emacs currently works), then you really can't reliably parse the buffer and thus can't reuse existing parsers. So you end up having to design a new parser for every language. For a real programming language, writing its grammar is a non-trivial task, so it can be a problem. The only thing that would save us is that we can be as permissive as we want: we don't have to reject invalid programs. Better: we can presume that the code is valid, and if it isn't we can anything we please. I hope Emacs will also develop in this direction. One good step in that direction would be to spice up syntax-table so that the basic syntactic elements can be bigger than a single-char (e.g. so as to handle begin...end). I've recently experimented with the use of an infix-precedence system where each infix "operator" can have a different left and right precedence, and then to try and use that to parse things like if/then/else (where `then' and `else' are seen as "infix" operators). It doesn't seem quite powerful enough for what I want (to indent Coq code in this case), but sufficiently close that maybe some minor extension will get me there. Stefan