syntax-propertize-function vs indentation lexer

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* syntax-propertize-function vs indentation lexer
@ 2013-05-29 15:19 Stephen Leake
  2013-05-29 15:28 ` Dmitry Gutov
  2013-05-29 17:52 ` Stefan Monnier
  0 siblings, 2 replies; 9+ messages in thread
From: Stephen Leake @ 2013-05-29 15:19 UTC (permalink / raw)
  To: emacs-devel

I'm working on a new indentation mode for Ada (and have been for a while
- see http://stephe-leake.org/emacs/ada-mode/emacs-ada-mode.html).

I've recently run across an issue with syntax-propertize-function.

I use that to set the syntax of Ada character constants (ie 'A') to
'string'. This matters for the particular constant '\', and for some
other cases.

The indentation engine relies on this, because it uses a lexer that uses
syntax properties.

However, syntax-propertize-function is only called from font-lock. So it
only runs on the visible part of the buffer, and only when font-lock is
enabled. So if neither of those conditions is present, the lexer fails
on character constants. In particular, I have a test suite that runs in
batch mode, when global-font-lock-mode is off.

To fix this, I could call syntax-propertize-function (or
ada-syntax-propertize directly) from ada-mode to initialize the buffer,
and again from before- or after-change-functions to catch edits.

Is there a better way?

-- 
-- Stephe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: syntax-propertize-function vs indentation lexer
  2013-05-29 15:19 syntax-propertize-function vs indentation lexer Stephen Leake
@ 2013-05-29 15:28 ` Dmitry Gutov
  2013-05-29 17:52 ` Stefan Monnier
  1 sibling, 0 replies; 9+ messages in thread
From: Dmitry Gutov @ 2013-05-29 15:28 UTC (permalink / raw)
  To: Stephen Leake; +Cc: emacs-devel

Stephen Leake <stephen_leake@member.fsf.org> writes:

> However, syntax-propertize-function is only called from font-lock. So it
> only runs on the visible part of the buffer, and only when font-lock is
> enabled. So if neither of those conditions is present, the lexer fails
> on character constants. In particular, I have a test suite that runs in
> batch mode, when global-font-lock-mode is off.
>
> To fix this, I could call syntax-propertize-function (or
> ada-syntax-propertize directly) from ada-mode to initialize the buffer,
> and again from before- or after-change-functions to catch edits.
>
> Is there a better way?

You can call `syntax-propertize', like `js-mode' does in its major mode
function.

It's usually called by `syntax-ppss', which is used in indentaion code
in most of the packages.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: syntax-propertize-function vs indentation lexer
  2013-05-29 15:19 syntax-propertize-function vs indentation lexer Stephen Leake
  2013-05-29 15:28 ` Dmitry Gutov
@ 2013-05-29 17:52 ` Stefan Monnier
  2013-05-30  9:15   ` Stephen Leake
  1 sibling, 1 reply; 9+ messages in thread
From: Stefan Monnier @ 2013-05-29 17:52 UTC (permalink / raw)
  To: Stephen Leake; +Cc: emacs-devel

> However, syntax-propertize-function is only called from font-lock. So it
> only runs on the visible part of the buffer, and only when font-lock is
> enabled. So if neither of those conditions is present, the lexer fails
> on character constants. In particular, I have a test suite that runs in
> batch mode, when global-font-lock-mode is off.

"syntax propertization" is done lazily, so if/when you need it, you need
to call `syntax-propertize'.  In many cases it's done for you
(e.g. indent-according-to-mode makes sure it's propertized til the end
of the current line), but not in all cases.

Not sure why propertization isn't done for you in the case
of indentation.  Maybe because you sometimes look after the current line?
Or because your indentation doesn't go through indent-according-to-mode?

        Stefan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: syntax-propertize-function vs indentation lexer
  2013-05-29 17:52 ` Stefan Monnier
@ 2013-05-30  9:15   ` Stephen Leake
  2013-05-30 14:02     ` Stefan Monnier
  0 siblings, 1 reply; 9+ messages in thread
From: Stephen Leake @ 2013-05-30  9:15 UTC (permalink / raw)
  To: emacs-devel

Stefan Monnier <monnier@IRO.UMontreal.CA> writes:

>> However, syntax-propertize-function is only called from font-lock. So it
>> only runs on the visible part of the buffer, and only when font-lock is
>> enabled. So if neither of those conditions is present, the lexer fails
>> on character constants. In particular, I have a test suite that runs in
>> batch mode, when global-font-lock-mode is off.
>
> "syntax propertization" is done lazily, so if/when you need it, you need
> to call `syntax-propertize'.  In many cases it's done for you
> (e.g. indent-according-to-mode makes sure it's propertized til the end
> of the current line), but not in all cases.

The doc string for syntax-propertize-function only mentions
font-lock, not indentation etc; it should say "most syntax uses", or
better, list all the places it is called. 

> Not sure why propertization isn't done for you in the case
> of indentation.  Maybe because you sometimes look after the current line?
> Or because your indentation doesn't go through
> indent-according-to-mode?

The later; I'm parsing the entire buffer with an LALR parser in
ada-mode, and whenever it changes, and caching the results for use by
indent. So far it's quite fast.

So I need to call 

(syntax-propertize (point-max)) 

in ada-mode and 

(syntax-ppss-flush-cache begin) 
(syntax-propertize end)

in the after-change hook.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: syntax-propertize-function vs indentation lexer
  2013-05-30  9:15   ` Stephen Leake
@ 2013-05-30 14:02     ` Stefan Monnier
  2013-05-31  7:45       ` Stephen Leake
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Monnier @ 2013-05-30 14:02 UTC (permalink / raw)
  To: Stephen Leake; +Cc: emacs-devel

> The doc string for syntax-propertize-function only mentions
> font-lock, not indentation etc; it should say "most syntax uses", or
> better, list all the places it is called. 

Oops, indeed it singles out font-lock.  I just installed the patch below
which should address this problem.

> The later; I'm parsing the entire buffer with an LALR parser in
> ada-mode, and whenever it changes,

Sounds expensive.  How does it cope with large buffers?

> and caching the results for use by indent. So far it's quite fast.

How much time does it take to open a 1MB file?

> So I need to call
> (syntax-propertize (point-max))
> in ada-mode

I wouldn't put it in ada-mode, no.  Instead, I'd put it closer to the
code that actually needs those properties to be applied.  E.g. I'd
either put it in the LALR parser code (if that code needs the syntax
properties) or in the indentation code.  Note that calling
syntax-propertize repeatedly is cheap: if the region has already been
handled, it returns almost instantly since it begins with

  (when (and syntax-propertize-function
             (< syntax-propertize--done pos))

Also I probably wouldn't put (syntax-propertize (point-max)), but
instead use (syntax-propertize end) where `end' is the end of the region
being currently LALR-parsed or being considered by the indentation code.

> (syntax-ppss-flush-cache begin) 
> (syntax-propertize end)
> in the after-change hook.

You might want to put the syntax-ppss-flush-cache there (although
syntax.el should already take care of that, normally), but the
syntax-propertize doesn't belong there either (since it belong to the
code that actually uses those properties, i.e. either the parser or the
indentation).

        Stefan

=== modified file 'lisp/emacs-lisp/syntax.el'
--- lisp/emacs-lisp/syntax.el	2013-04-22 14:11:37 +0000
+++ lisp/emacs-lisp/syntax.el	2013-05-30 13:55:38 +0000
@@ -56,12 +56,13 @@
   ;; syntax-ppss-flush-cache since that would not only flush the cache but also
   ;; reset syntax-propertize--done which should not be done in this case).
   "Mode-specific function to apply `syntax-table' text properties.
-The value of this variable is a function to be called by Font
-Lock mode, prior to performing syntactic fontification on a
-stretch of text.  It is given two arguments, START and END: the
-start and end of the text to be fontified.  Major modes can
-specify a custom function to apply `syntax-table' properties to
-override the default syntax table in special cases.
+It is the work horse of `syntax-propertize', which is called by things like
+Font-Lock and indentation.
+
+It is given two arguments, START and END: the start and end of the text to
+which `syntax-table' might need to be applied.  Major modes can use this to
+override the buffer's syntax table for special syntactic constructs that
+cannot be handled just by the buffer's syntax-table.

 The specified function may call `syntax-ppss' on any position
 before END, but it should not call `syntax-ppss-flush-cache',

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: syntax-propertize-function vs indentation lexer
  2013-05-30 14:02     ` Stefan Monnier
@ 2013-05-31  7:45       ` Stephen Leake
  2013-05-31 13:23         ` Stefan Monnier
  0 siblings, 1 reply; 9+ messages in thread
From: Stephen Leake @ 2013-05-31  7:45 UTC (permalink / raw)
  To: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> The doc string for syntax-propertize-function only mentions
>> font-lock, not indentation etc; it should say "most syntax uses", or
>> better, list all the places it is called. 
>
> Oops, indeed it singles out font-lock.  I just installed the patch below
> which should address this problem.

Looks good, thanks.

>> The later; I'm parsing the entire buffer with an LALR parser in
>> ada-mode, and whenever it changes,
>
> Sounds expensive.  How does it cope with large buffers?

Not clear yet - I'm still getting the Ada grammar right. 

The parser is actually generalized LALR, which spawns parallel parsers
for grammar conflicts and ambiguities. So it can be very slow when the
grammar has too many conflicts or is ambiguous - running 64 parsers in
parallel is a lot slower than running 1 :). But it works well when the
conflict can be resolved in a few tokens, and is much easier than
reconstructing the grammar to eliminate the conflict.

>> and caching the results for use by indent. So far it's quite fast.
>
> How much time does it take to open a 1MB file?

I've never seen a 1MB Ada source code file.

Such a file would never be accepted in any project I have worked on, in
any source language, unless it was generated from some other source. In
which case it should not be edited by hand, and should only be read
rarely.

So I don't think that's a realistic use case, and there is a reasonable
limit to file size.

Of course, it should be possible to open such a file in any case, so
perhaps I'll need an explicit limit to disable parsing on large files.
But any discussion of parser speed is premature at this point.

>> So I need to call
>> (syntax-propertize (point-max))
>> in ada-mode
>
> I wouldn't put it in ada-mode, no.  Instead, I'd put it closer to the
> code that actually needs those properties to be applied. E.g. I'd
> either put it in the LALR parser code (if that code needs the syntax
> properties) or in the indentation code.  

There may be other code, completely independent of the parser, that
relies on syntax; imenu, for example.

I'm also using the cached data for navigation (moving from 'if' to
'then' to 'elsif' to 'end if' etc); that is logically independent of
indentation (but not of the parser, of course).

> Note that calling
> syntax-propertize repeatedly is cheap: if the region has already been
> handled, it returns almost instantly since it begins with
>
>   (when (and syntax-propertize-function
>              (< syntax-propertize--done pos))

yes, that does help. 

> Also I probably wouldn't put (syntax-propertize (point-max)), but
> instead use (syntax-propertize end) where `end' is the end of the region
> being currently LALR-parsed or being considered by the indentation
> code.

I considered that. 

Since the parser is asynchronous from the indentation, it would have to
go in the parser (actually lexer) code. wisi-forward-token would be a
logical place. But what would be the right guess for 'end'? The first
step in wisi-forward-token is forward-comment, which can skip quite large
portions of the buffer. 

LALR always parses an entire top-level grammar structure. For Ada files,
that is the whole file, for all the coding standards I'm aware of. The
language itself allows for more than one per file, but doing that messes
up dependency analysis, and prevents minimal recompilation.

So the only reasonable guess for 'end', for Ada, is point-max. There may
be other reasonable guesses for other languages, so a language-specific
hook might be a good choice. 

C++, for example, often has multiple classes per file; each class would
be a top-level grammar structure. But scanning for the end of the class
logically requires running syntax-propertize (maybe not actually for
C++, but some language might require that), so we've got a real problem.
(C++ does require running the macro preprocessor before any scanning,
which is very expensive). I'm not worrying about that right now, but
(point-max) is a cheap and always correct answer.

>> (syntax-ppss-flush-cache begin) 
>> (syntax-propertize end)
>> in the after-change hook.
>
> You might want to put the syntax-ppss-flush-cache there (although
> syntax.el should already take care of that, normally), 

How does syntax.el take care of this? The only function on
after-change-functions by default is jit-lock-after-change. And that's
only there if font-lock is on.

I have been implicitly assuming syntax-ppss is correct after a text
change, but I never investigated how that worked.

> but the
> syntax-propertize doesn't belong there either (since it belong to the
> code that actually uses those properties, i.e. either the parser or the
> indentation).

Syntax properties are closely tied to the text (they are an extension of
the syntax table), and used by several independent functions, and thus
should be kept consistent with the text as much as possible. So
syntax-propertize should be run whenever the text changes.

The same could be said for the cached parse results; the parser should
also be run from after-change-functions. I'm not going that
far (yet) because I'm still debugging the parser, and don't want it
called automatically too early. But that may be the right move
eventually, to support imenu etc.

Another design choice would be to have all the low-level functions that
rely on syntax (forward-comment, forward-word, etc) call
syntax-propertize. That would certainly be more transparent, and is
consistent with what you are advocating. But that runs into the
'reasonable guess for end' problem; I think the language mode is the
best place to resolve that problem. A language hook to provide the guess
would be reasonable, but that hook could be expensive (since it reduces
parser time, which is even more expensive), and thus should not be
called more often than necessary (certainly not for every call of
forward-comment).

I think you are actually advocating for a third choice; any code that
depends on low-level syntax functions must be aware of
syntax-propertize, and call it appropriately. That makes sense.

It would help if the doc string for parse-partial-sexp mentioned
syntax-propertize and syntax-ppss-flush-cache; then I would have been
aware of this issue sooner.

-- 
-- Stephe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: syntax-propertize-function vs indentation lexer
  2013-05-31  7:45       ` Stephen Leake
@ 2013-05-31 13:23         ` Stefan Monnier
  2013-06-01  5:19           ` Stephen Leake
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Monnier @ 2013-05-31 13:23 UTC (permalink / raw)
  To: Stephen Leake; +Cc: emacs-devel

>> Sounds expensive.  How does it cope with large buffers?

> Not clear yet - I'm still getting the Ada grammar right. 

> The parser is actually generalized LALR, which spawns parallel parsers
> for grammar conflicts and ambiguities. So it can be very slow when the
> grammar has too many conflicts or is ambiguous - running 64 parsers in
> parallel is a lot slower than running 1 :). But it works well when the
> conflict can be resolved in a few tokens, and is much easier than
> reconstructing the grammar to eliminate the conflict.

Aha!  So the parser is a separate executable written in some other
language than Elisp, I guess?  In that case parsing speed should not be
a serious concern (except for the possible explosion of parallel parsers).

Although, using it for indentation makes speed a real concern:
in many cases one does "edit+reindent", so if you put a "full reparse"
between the two, it needs to be about as fast as instantaneous.

> Such a file would never be accepted in any project I have worked on, in
> any source language, unless it was generated from some other source.

I agree that 1MB is very unusual, but emacs/src/xdisp.c is pretty damn
close to 1MB.  And I've seen several times files of several hundred KB.

So I agree that "fast enough at 1MB" implies "fast enough", but if speed
is a problem for 1MB, then it might also be a problem for a real file on
a real machine (maybe also because that machine is slower than yours).

> Of course, it should be possible to open such a file in any case, so
> perhaps I'll need an explicit limit to disable parsing on large files.

Making sure C-g can be used should be sufficient.

> But any discussion of parser speed is premature at this point.

Fair enough.

>>> So I need to call
>>> (syntax-propertize (point-max))
>>> in ada-mode
>> 
>> I wouldn't put it in ada-mode, no.  Instead, I'd put it closer to the
>> code that actually needs those properties to be applied. E.g. I'd
>> either put it in the LALR parser code (if that code needs the syntax
>> properties) or in the indentation code.
> There may be other code, completely independent of the parser, that
> relies on syntax; imenu, for example.

Then imenu should call syntax-propertize as well.

> I'm also using the cached data for navigation (moving from 'if' to
> 'then' to 'elsif' to 'end if' etc); that is logically independent of
> indentation (but not of the parser, of course).

Then navigation should also call syntax-propertize (indeed smie's sexp
navigation also calls syntax-propertize for the same reason).

> Since the parser is asynchronous from the indentation, it would have to
> go in the parser (actually lexer) code. wisi-forward-token would be a
> logical place. But what would be the right guess for 'end'? The first
> step in wisi-forward-token is forward-comment, which can skip quite large
> portions of the buffer. 

I have the same problem in SMIE navigation, indeed.  For backward
navigation, that's not a problem, but for forward navigation, I don't
have a good answer.  Luckily, SMIE mostly cares about backward
navigation since that's what needed for indentation, but currently
forward navigation can bump into parse bugs for failure of calling
syntax-propertize on the text being considered.

> How does syntax.el take care of this? The only function on
> after-change-functions by default is jit-lock-after-change. And that's
> only there if font-lock is on.

It's added to before-change-functions.

> Syntax properties are closely tied to the text (they are an extension of
> the syntax table), and used by several independent functions, and thus
> should be kept consistent with the text as much as possible. So
> syntax-propertize should be run whenever the text changes.

That's doing the work eagerly, whereas syntax-propertize is designed to
do the work lazily.  In your case, since you very often need to look at
syntax properties and that you often need them to be correct upto
point-max, laziness probably doesn't buy you much.  But you may suffer
from performance problems.

> Another design choice would be to have all the low-level functions that
> rely on syntax (forward-comment, forward-word, etc) call
> syntax-propertize. That would certainly be more transparent, and is
> consistent with what you are advocating.

Right, that's the direction I'm headed.

> But that runs into the 'reasonable guess for end' problem; I think the
> language mode is the best place to resolve that problem. A language
> hook to provide the guess would be reasonable, but that hook could be
> expensive (since it reduces parser time, which is even more
> expensive), and thus should not be called more often than necessary
> (certainly not for every call of forward-comment).

I think a better option for forward navigation is to parse "upto 1KB
ahead" and constantly check whether we reached that parse limit to push
it 1KB further.

> I think you are actually advocating for a third choice; any code that
> depends on low-level syntax functions must be aware of
> syntax-propertize, and call it appropriately. That makes sense.

Right.

> It would help if the doc string for parse-partial-sexp mentioned
> syntax-propertize and syntax-ppss-flush-cache; then I would have been
> aware of this issue sooner.

I consider parse-partial-sexp as a mostly internal detail of
syntax-ppss nowadays.

        Stefan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: syntax-propertize-function vs indentation lexer
  2013-05-31 13:23         ` Stefan Monnier
@ 2013-06-01  5:19           ` Stephen Leake
  2013-06-01 14:31             ` Stefan Monnier
  0 siblings, 1 reply; 9+ messages in thread
From: Stephen Leake @ 2013-06-01  5:19 UTC (permalink / raw)
  To: emacs-devel

Stefan Monnier <monnier@IRO.UMontreal.CA> writes:

>>> Sounds expensive.  How does it cope with large buffers?
>
>> Not clear yet - I'm still getting the Ada grammar right. 
>
>> The parser is actually generalized LALR, which spawns parallel parsers
>> for grammar conflicts and ambiguities. So it can be very slow when the
>> grammar has too many conflicts or is ambiguous - running 64 parsers in
>> parallel is a lot slower than running 1 :). But it works well when the
>> conflict can be resolved in a few tokens, and is much easier than
>> reconstructing the grammar to eliminate the conflict.
>
> Aha!  So the parser is a separate executable written in some other
> language than Elisp, I guess?  In that case parsing speed should not be
> a serious concern (except for the possible explosion of parallel
> parsers).

No, it's in elisp. See
http://stephe-leake.org/emacs/ada-mode/emacs-ada-mode.html for the code.

> Although, using it for indentation makes speed a real concern:
> in many cases one does "edit+reindent", so if you put a "full reparse"
> between the two, it needs to be about as fast as instantaneous.

That's how my current SMIE-based parser works, and it's "fast enough".

I'm working on replacing it with an LALR parser, because the resulting
code is much cleaner.

>> Such a file would never be accepted in any project I have worked on, in
>> any source language, unless it was generated from some other source.
>
> I agree that 1MB is very unusual, but emacs/src/xdisp.c is pretty damn
> close to 1MB.  And I've seen several times files of several hundred
> KB.

Ok.

>> I'm also using the cached data for navigation (moving from 'if' to
>> 'then' to 'elsif' to 'end if' etc); that is logically independent of
>> indentation (but not of the parser, of course).
>
> Then navigation should also call syntax-propertize (indeed smie's sexp
> navigation also calls syntax-propertize for the same reason).

Yes, I think this is the best solution.

>> Since the parser is asynchronous from the indentation, it would have to
>> go in the parser (actually lexer) code. wisi-forward-token would be a
>> logical place. But what would be the right guess for 'end'? The first
>> step in wisi-forward-token is forward-comment, which can skip quite large
>> portions of the buffer. 
>
> I have the same problem in SMIE navigation, indeed.  For backward
> navigation, that's not a problem, but for forward navigation, I don't
> have a good answer.  Luckily, SMIE mostly cares about backward
> navigation since that's what needed for indentation, but currently
> forward navigation can bump into parse bugs for failure of calling
> syntax-propertize on the text being considered.

In my case, putting the call to syntax-propertize in wisi-parse-buffer,
not in wisi-forward-token, solves the problem; wisi-parse-buffer always
parses the whole buffer :). This could easily be generalized to take an
'end' arg.

>> How does syntax.el take care of this? The only function on
>> after-change-functions by default is jit-lock-after-change. And that's
>> only there if font-lock is on.
>
> It's added to before-change-functions.

Doh!

Thanks,

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: syntax-propertize-function vs indentation lexer
  2013-06-01  5:19           ` Stephen Leake
@ 2013-06-01 14:31             ` Stefan Monnier
  0 siblings, 0 replies; 9+ messages in thread
From: Stefan Monnier @ 2013-06-01 14:31 UTC (permalink / raw)
  To: Stephen Leake; +Cc: emacs-devel

>> Although, using it for indentation makes speed a real concern:
>> in many cases one does "edit+reindent", so if you put a "full reparse"
>> between the two, it needs to be about as fast as instantaneous.
> That's how my current SMIE-based parser works, and it's "fast enough".

Very interesting.

> I'm working on replacing it with an LALR parser, because the resulting
> code is much cleaner.

Makes sense: if you're going to parse the whole buffer forward anyway,
the limitations of SMIE don't make much sense: they're there
specifically so you can parse backward on-the-fly instead of keeping
a full-parse around.


        Stefan



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-06-01 14:31 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-29 15:19 syntax-propertize-function vs indentation lexer Stephen Leake
2013-05-29 15:28 ` Dmitry Gutov
2013-05-29 17:52 ` Stefan Monnier
2013-05-30  9:15   ` Stephen Leake
2013-05-30 14:02     ` Stefan Monnier
2013-05-31  7:45       ` Stephen Leake
2013-05-31 13:23         ` Stefan Monnier
2013-06-01  5:19           ` Stephen Leake
2013-06-01 14:31             ` Stefan Monnier

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).