From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Stephen Leake <stephen_leake@member.fsf.org>
Newsgroups: gmane.emacs.devel
Subject: Re: syntax-propertize-function vs indentation lexer
Date: Fri, 31 May 2013 03:45:07 -0400
Message-ID: <85k3mf8uf0.fsf@member.fsf.org>
References: <85mwrdbypv.fsf@member.fsf.org>
	<jwvfvx5u17h.fsf-monnier+emacs@gnu.org>	<85bo7sbzhh.fsf@member.fsf.org>
	<jwvwqqg7fna.fsf-monnier+emacs@gnu.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain
X-Trace: ger.gmane.org 1369986327 27989 80.91.229.3 (31 May 2013 07:45:27 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Fri, 31 May 2013 07:45:27 +0000 (UTC)
To: emacs-devel@gnu.org
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri May 31 09:45:27 2013
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1UiK1e-0002tj-IY
	for ged-emacs-devel@m.gmane.org; Fri, 31 May 2013 09:45:26 +0200
Original-Received: from localhost ([::1]:35712 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1UiK1Z-00009l-8U
	for ged-emacs-devel@m.gmane.org; Fri, 31 May 2013 03:45:21 -0400
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:56119)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <stephen_leake@member.fsf.org>) id 1UiK1V-00008w-Hn
	for emacs-devel@gnu.org; Fri, 31 May 2013 03:45:19 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <stephen_leake@member.fsf.org>) id 1UiK1S-0001qZ-Bg
	for emacs-devel@gnu.org; Fri, 31 May 2013 03:45:17 -0400
Original-Received: from vms173007pub.verizon.net ([206.46.173.7]:40307)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <stephen_leake@member.fsf.org>) id 1UiK1S-0001qP-4G
	for emacs-devel@gnu.org; Fri, 31 May 2013 03:45:14 -0400
Original-Received: from TAKVER ([unknown] [71.241.247.125]) by vms173007.mailsrvcs.net
	(Sun Java(tm) System Messaging Server 7u2-7.02 32bit (built Apr 16
	2009)) with ESMTPA id <0MNN00FLPK7CW190@vms173007.mailsrvcs.net> for
	emacs-devel@gnu.org; Fri, 31 May 2013 02:45:12 -0500 (CDT)
In-reply-to: <jwvwqqg7fna.fsf-monnier+emacs@gnu.org>
	(Stefan Monnier's message	of "Thu, 30 May 2013 10:02:59 -0400")
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (windows-nt)
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
	recognized.
X-Received-From: 206.46.173.7
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:159947
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/159947>

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> The doc string for syntax-propertize-function only mentions
>> font-lock, not indentation etc; it should say "most syntax uses", or
>> better, list all the places it is called. 
>
> Oops, indeed it singles out font-lock.  I just installed the patch below
> which should address this problem.

Looks good, thanks.

>> The later; I'm parsing the entire buffer with an LALR parser in
>> ada-mode, and whenever it changes,
>
> Sounds expensive.  How does it cope with large buffers?

Not clear yet - I'm still getting the Ada grammar right. 

The parser is actually generalized LALR, which spawns parallel parsers
for grammar conflicts and ambiguities. So it can be very slow when the
grammar has too many conflicts or is ambiguous - running 64 parsers in
parallel is a lot slower than running 1 :). But it works well when the
conflict can be resolved in a few tokens, and is much easier than
reconstructing the grammar to eliminate the conflict.

>> and caching the results for use by indent. So far it's quite fast.
>
> How much time does it take to open a 1MB file?

I've never seen a 1MB Ada source code file.

Such a file would never be accepted in any project I have worked on, in
any source language, unless it was generated from some other source. In
which case it should not be edited by hand, and should only be read
rarely.

So I don't think that's a realistic use case, and there is a reasonable
limit to file size.

Of course, it should be possible to open such a file in any case, so
perhaps I'll need an explicit limit to disable parsing on large files.
But any discussion of parser speed is premature at this point.

>> So I need to call
>> (syntax-propertize (point-max))
>> in ada-mode
>
> I wouldn't put it in ada-mode, no.  Instead, I'd put it closer to the
> code that actually needs those properties to be applied. E.g. I'd
> either put it in the LALR parser code (if that code needs the syntax
> properties) or in the indentation code.  

There may be other code, completely independent of the parser, that
relies on syntax; imenu, for example.

I'm also using the cached data for navigation (moving from 'if' to
'then' to 'elsif' to 'end if' etc); that is logically independent of
indentation (but not of the parser, of course).

> Note that calling
> syntax-propertize repeatedly is cheap: if the region has already been
> handled, it returns almost instantly since it begins with
>
>   (when (and syntax-propertize-function
>              (< syntax-propertize--done pos))

yes, that does help. 

> Also I probably wouldn't put (syntax-propertize (point-max)), but
> instead use (syntax-propertize end) where `end' is the end of the region
> being currently LALR-parsed or being considered by the indentation
> code.

I considered that. 

Since the parser is asynchronous from the indentation, it would have to
go in the parser (actually lexer) code. wisi-forward-token would be a
logical place. But what would be the right guess for 'end'? The first
step in wisi-forward-token is forward-comment, which can skip quite large
portions of the buffer. 

LALR always parses an entire top-level grammar structure. For Ada files,
that is the whole file, for all the coding standards I'm aware of. The
language itself allows for more than one per file, but doing that messes
up dependency analysis, and prevents minimal recompilation.

So the only reasonable guess for 'end', for Ada, is point-max. There may
be other reasonable guesses for other languages, so a language-specific
hook might be a good choice. 

C++, for example, often has multiple classes per file; each class would
be a top-level grammar structure. But scanning for the end of the class
logically requires running syntax-propertize (maybe not actually for
C++, but some language might require that), so we've got a real problem.
(C++ does require running the macro preprocessor before any scanning,
which is very expensive). I'm not worrying about that right now, but
(point-max) is a cheap and always correct answer.

>> (syntax-ppss-flush-cache begin) 
>> (syntax-propertize end)
>> in the after-change hook.
>
> You might want to put the syntax-ppss-flush-cache there (although
> syntax.el should already take care of that, normally), 

How does syntax.el take care of this? The only function on
after-change-functions by default is jit-lock-after-change. And that's
only there if font-lock is on.

I have been implicitly assuming syntax-ppss is correct after a text
change, but I never investigated how that worked.

> but the
> syntax-propertize doesn't belong there either (since it belong to the
> code that actually uses those properties, i.e. either the parser or the
> indentation).

Syntax properties are closely tied to the text (they are an extension of
the syntax table), and used by several independent functions, and thus
should be kept consistent with the text as much as possible. So
syntax-propertize should be run whenever the text changes.

The same could be said for the cached parse results; the parser should
also be run from after-change-functions. I'm not going that
far (yet) because I'm still debugging the parser, and don't want it
called automatically too early. But that may be the right move
eventually, to support imenu etc.


Another design choice would be to have all the low-level functions that
rely on syntax (forward-comment, forward-word, etc) call
syntax-propertize. That would certainly be more transparent, and is
consistent with what you are advocating. But that runs into the
'reasonable guess for end' problem; I think the language mode is the
best place to resolve that problem. A language hook to provide the guess
would be reasonable, but that hook could be expensive (since it reduces
parser time, which is even more expensive), and thus should not be
called more often than necessary (certainly not for every call of
forward-comment).


I think you are actually advocating for a third choice; any code that
depends on low-level syntax functions must be aware of
syntax-propertize, and call it appropriately. That makes sense.

It would help if the doc string for parse-partial-sexp mentioned
syntax-propertize and syntax-ppss-flush-cache; then I would have been
aware of this issue sooner.

-- 
-- Stephe