Font-lock of comments using comment tokens, does it work?

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* Font-lock of comments using comment tokens, does it work?
@ 2015-06-03 16:46 Björn Lindqvist
  0 siblings, 0 replies; 11+ messages in thread
From: Björn Lindqvist @ 2015-06-03 16:46 UTC (permalink / raw)
  To: help-gnu-emacs

Hello emacs,

I have a really complicated font-locking problem I'm trying to solve
for a major mode. It's like I've tried everything but nothing
works. Here is the question I asked on Stack Overflow and got some
help with but it didn't go all the way:

http://stackoverflow.com/questions/29973458/avoid-font-locking-interfering-inside-of-comments

I want to font-lock to understand that two short strings, e.g FOO and
BAR are the comment tokens. The tokens themselves should be
font-locked as comments and everything following them until the end of
line should also be comments.

The problem is that the strings only start comments if they are
free-standing tokens. So on these four lines there are comments:

    random code FOO random comment
    stuff BAR comment "with stuff"
    BAR FOO BAR
    FOO

On these four lines there are NO comments:

    FOObar random come
    BARFOO random code
    random code xyFOOzw
    random code "with string FOO " etc ...

Because the comment tokens are not separate. I'm suspecting that I've
found a limitation in emacs font-locking and that this is impossible
to get completely right. I'd love to be proven wrong though. :)

-- 
mvh/best regards Björn Lindqvist

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Font-lock of comments using comment tokens, does it work?
       [not found] <mailman.4238.1433357678.904.help-gnu-emacs@gnu.org>
@ 2015-06-03 19:11 ` Stefan Monnier
  2015-06-03 23:37   ` Björn Lindqvist
       [not found]   ` <mailman.4244.1433374628.904.help-gnu-emacs@gnu.org>
  0 siblings, 2 replies; 11+ messages in thread
From: Stefan Monnier @ 2015-06-03 19:11 UTC (permalink / raw)
  To: help-gnu-emacs

> I have a really complicated font-locking problem I'm trying to solve
> for a major mode. It's like I've tried everything but nothing
> works. Here is the question I asked on Stack Overflow and got some
> help with but it didn't go all the way:

The answer there gives you the technique to use.  AFAICT you only need
to adjust the regexp he used in exmark-syntax-propertize.

You'll want to read about syntax-tables in the Elisp manual to
understand what the "_" means in that function (for a quick
refresher on which char means what, C-h f modify-syntax-entry RET is
what I use, but you first need to read the Elisp reference to really
understand how that works).

        Stefan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Font-lock of comments using comment tokens, does it work?
  2015-06-03 19:11 ` Font-lock of comments using comment tokens, does it work? Stefan Monnier
@ 2015-06-03 23:37   ` Björn Lindqvist
       [not found]   ` <mailman.4244.1433374628.904.help-gnu-emacs@gnu.org>
  1 sibling, 0 replies; 11+ messages in thread
From: Björn Lindqvist @ 2015-06-03 23:37 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: help-gnu-emacs

2015-06-03 21:11 GMT+02:00 Stefan Monnier <monnier@iro.umontreal.ca>:
>> I have a really complicated font-locking problem I'm trying to solve
>> for a major mode. It's like I've tried everything but nothing
>> works. Here is the question I asked on Stack Overflow and got some
>> help with but it didn't go all the way:
>
> The answer there gives you the technique to use.  AFAICT you only need
> to adjust the regexp he used in exmark-syntax-propertize.
>
> You'll want to read about syntax-tables in the Elisp manual to
> understand what the "_" means in that function (for a quick
> refresher on which char means what, C-h f modify-syntax-entry RET is
> what I use, but you first need to read the Elisp reference to really
> understand how that works).

If you think you know what it should be changed to, can you tell me?
I've tried a dozen different permutations of the regexp and none of
them produces the desired result. I've also read the syntactic
font-lock and syntax table sections of the manual several times and I
still don't get it. (sorry for the double mail)


-- 
mvh/best regards Björn Lindqvist



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Font-lock of comments using comment tokens, does it work?
       [not found]   ` <mailman.4244.1433374628.904.help-gnu-emacs@gnu.org>
@ 2015-06-04  3:42     ` Stefan Monnier
  2015-06-04 11:10       ` Björn Lindqvist
       [not found]       ` <mailman.4276.1433416248.904.help-gnu-emacs@gnu.org>
  0 siblings, 2 replies; 11+ messages in thread
From: Stefan Monnier @ 2015-06-04  3:42 UTC (permalink / raw)
  To: help-gnu-emacs

> If you think you know what it should be changed to, can you tell me?

I don't know enough of the context to be sure.  Also, as Emacs
maintainer I have enough experience/knowledge to fix most users's
problems, but if I do that I'll just end up with more users with new
problems to fix.  So instead I'm better off trying to train them so they
can fix their problems themselves and even help me improve Emacs.

> I've tried a dozen different permutations of the regexp and none of
> them produces the desired result.

What have you tried?  What/where were the undesired results?

> I've also read the syntactic font-lock and syntax table sections of
> the manual several times and I still don't get it.

So you've covered the basics, good.
The thing you need to understand is that it all boils down to the
"syntax" given to the "!" character.  The default is set in the
buffer-local syntax-table, and this default is adjusted by
`syntax-table' text-properties which are applied via syntax-propertize.
So you can always go to a "!" and then hit C-u C-x = to see what is the
syntax of *this* particular "!" character, and whether that is the
desired syntax.  If it's not, then you can try
M-: (re-search-forward "theregexp" nil t) to see if the pattern you used
does match or doesn't match this char.

        Stefan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Font-lock of comments using comment tokens, does it work?
  2015-06-04  3:42     ` Stefan Monnier
@ 2015-06-04 11:10       ` Björn Lindqvist
       [not found]       ` <mailman.4276.1433416248.904.help-gnu-emacs@gnu.org>
  1 sibling, 0 replies; 11+ messages in thread
From: Björn Lindqvist @ 2015-06-04 11:10 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: help-gnu-emacs

2015-06-04 5:42 GMT+02:00 Stefan Monnier <monnier@iro.umontreal.ca>:
>> If you think you know what it should be changed to, can you tell me?
>
> I don't know enough of the context to be sure.

What extra context can I provide you with? If there is something in my
problem description that is unclear I can try to explain it more
precisely.

> Also, as Emacs
> maintainer I have enough experience/knowledge to fix most users's
> problems, but if I do that I'll just end up with more users with new
> problems to fix.  So instead I'm better off trying to train them so they
> can fix their problems themselves and even help me improve Emacs.
>
>> I've tried a dozen different permutations of the regexp and none of
>> them produces the desired result.
>
> What have you tried?  What/where were the undesired results?

("[a-zA-Z0-9_]\\(! \\) " (1 "_")))
("\\(!\\)[a-zA-Z0-9_]" (1 "_")))
("\\(! \\)[a-zA-Z0-9_]" (1 "_")))
("[a-zA-Z0-9_]\\(!\\) " (1 "_ ")))
("[a-zA-Z0-9_]\\(!\\) " (1 " _ ")))

And so on.  The undesired results were incorrect font-locking of
comments and regular tokens being falsely identified as comment
tokens.


-- 
mvh/best regards Björn Lindqvist



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Font-lock of comments using comment tokens, does it work?
       [not found]       ` <mailman.4276.1433416248.904.help-gnu-emacs@gnu.org>
@ 2015-06-04 22:11         ` Stefan Monnier
  2015-06-05  3:29           ` Björn Lindqvist
  0 siblings, 1 reply; 11+ messages in thread
From: Stefan Monnier @ 2015-06-04 22:11 UTC (permalink / raw)
  To: help-gnu-emacs

>> Also, as Emacs maintainer I have enough experience/knowledge to fix
>> most users's problems, but if I do that I'll just end up with more
>> users with new problems to fix.  So instead I'm better off trying to
>> train them so they can fix their problems themselves and even help me
>> improve Emacs.
>>> I've tried a dozen different permutations of the regexp and none of
>>> them produces the desired result.
>> What have you tried?  What/where were the undesired results?

> ("[a-zA-Z0-9_]\\(! \\) " (1 "_")))

IIUC you want all "!" that are surrounded by spaces to be treated as
comment starters.  And you've marked "!" as a comment starter by default
(i.e. in the mode's syntax-table), so you need to mark all "!" which are
not surrounded by spaces as being not-comment-starters.

The above regexp does part of the work, but only does it for those "!"
which are preceded by a latin letter or a number and are followed by
a space.  E.g. it will fail on those "!" which don't have a space afterwards.

> ("\\(!\\)[a-zA-Z0-9_]" (1 "_")))

This one will fail on those "!" which are followed with a letter that's
neither a space nor a latin letter nor a number.  And it will fail on
those "!" which are followed by a space but are not preceded by a space.

To me, the translation into regexp of «all "!" which are not surrounded
by spaces» would look like "[^ ]![^ ]".  Have you tried something like
that?  Of course, it'll still probably require more tweaking because
I suspect that «all "!" which are not surrounded by spaces» is not
actually a precise description of all cases that matter.  E.g. I suspect
that if the "!" is preceded by a newline (i.e. is at the beginning of
a line) it should still be considered a comment starter.  Same thing if
it's preceded by a TAB.  Also it's likely that " !! " would also start
a comment, so "followed by a space" is too strict as well.  But then,
I don't know if " !!a" would be treated as starting a comment.
IOW, maybe you'll want something like "[^ \n\t]\\(!+\\)[^
\t\n]" instead.

One more thing: if "! as a normal char" is more common than "! as
a comment starter", it might be worthwhile to take the opposite approach
and define the syntax of "!" in the mode's syntax-table as being "_" and
then in syntax-propertize-function mark those "!" which start a comment
as having syntax "<".

Yet another thing: if you have trouble catching all cases with a single
regexp, you can use more rules, as in

   (syntax-propertize-rules
    ("[a-zA-Z0-9_]\\(! \\) " (1 "_"))
    ("\\(!\\)[a-zA-Z0-9_]" (1 "_")))

        Stefan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Font-lock of comments using comment tokens, does it work?
  2015-06-04 22:11         ` Stefan Monnier
@ 2015-06-05  3:29           ` Björn Lindqvist
  2015-06-05  6:53             ` tomas
  0 siblings, 1 reply; 11+ messages in thread
From: Björn Lindqvist @ 2015-06-05  3:29 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: help-gnu-emacs

2015-06-05 0:11 GMT+02:00 Stefan Monnier <monnier@iro.umontreal.ca>:
>>> Also, as Emacs maintainer I have enough experience/knowledge to fix
>>> most users's problems, but if I do that I'll just end up with more
>>> users with new problems to fix.  So instead I'm better off trying to
>>> train them so they can fix their problems themselves and even help me
>>> improve Emacs.
>>>> I've tried a dozen different permutations of the regexp and none of
>>>> them produces the desired result.
>>> What have you tried?  What/where were the undesired results?
>
>> ("[a-zA-Z0-9_]\\(! \\) " (1 "_")))
>
> IIUC you want all "!" that are surrounded by spaces to be treated as
> comment starters.

No. I want two strings, FOO and BAR (or ! doesn't matter, same
principle) to start comments iff they are separate tokens. Look at my
examples if the definition isn't so precise. FOO written at the top of
the buffer and followed by a newline would therefore start a comment.

> The above regexp does part of the work, but only does it for those "!"
> which are preceded by a latin letter or a number and are followed by
> a space.  E.g. it will fail on those "!" which don't have a space afterwards.
>
>> ("\\(!\\)[a-zA-Z0-9_]" (1 "_")))
>
> This one will fail on those "!" which are followed with a letter that's
> neither a space nor a latin letter nor a number.  And it will fail on
> those "!" which are followed by a space but are not preceded by a space.
>
> To me, the translation into regexp of «all "!" which are not surrounded
> by spaces» would look like "[^ ]![^ ]".  Have you tried something like
> that?

That turns the comment face of if the ! is in the middle, but not if
it prefixes or suffixes the token. abcFOO is wrongly interpreted as a
comment starter.

> Also it's likely that " !! " would also start
> a comment, so "followed by a space" is too strict as well.  But then,
> I don't know if " !!a" would be treated as starting a comment.
> IOW, maybe you'll want something like "[^ \n\t]\\(!+\\)[^
> \t\n]" instead.

No. In "!!" and "!!a" the comment token is not separate, so no comment.

> Yet another thing: if you have trouble catching all cases with a single
> regexp, you can use more rules, as in
>
>    (syntax-propertize-rules
>     ("[a-zA-Z0-9_]\\(! \\) " (1 "_"))
>     ("\\(!\\)[a-zA-Z0-9_]" (1 "_")))

It still messes up the comment font-locking. BTW I've noticed that if
the regexp is "test\\(!\\)" emacs correctly does not use comment face
on "test!". But if it is "\\(!\\)test" then "!test" is still seen as a
comment. That is inconsistent with what you have explained and the
elisp manual. So I think it is a bug.


-- 
mvh/best regards Björn Lindqvist



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Font-lock of comments using comment tokens, does it work?
  2015-06-05  3:29           ` Björn Lindqvist
@ 2015-06-05  6:53             ` tomas
  2015-06-05 19:37               ` Björn Lindqvist
  0 siblings, 1 reply; 11+ messages in thread
From: tomas @ 2015-06-05  6:53 UTC (permalink / raw)
  To: Björn Lindqvist; +Cc: help-gnu-emacs, Stefan Monnier

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Fri, Jun 05, 2015 at 05:29:36AM +0200, Björn Lindqvist wrote:
> 2015-06-05 0:11 GMT+02:00 Stefan Monnier <monnier@iro.umontreal.ca>:

[...]

> No. I want two strings, FOO and BAR (or ! doesn't matter, same
> principle) to start comments iff they are separate tokens.

Sorry to jump in the middle. I've been lurking in case I could help
(and to learn about font-lock).

Björn: you are assuming that everyone knows what's a "token" to you.
And you are assuming that everyone has the time to read and grasp
all your examples, first time. I for one don't know what your tokens
are. To put one extreme example, to C, the string 'a+b' are three
tokens, '++a' are two; for Lisp, the first and the second example are
both just *one* token.

Given that, you can't expect Stefan to even come near a regular expression
useful to you, since what they are doing is exactly *separate tokens*.

> > To me, the translation into regexp of «all "!" which are not surrounded
> > by spaces» would look like "[^ ]![^ ]".  Have you tried something like
> > that?
> 
> That turns the comment face of if the ! is in the middle, but not if
> it prefixes or suffixes the token. abcFOO is wrongly interpreted as a
> comment starter.

You mean when FOO is at the end of the line? Then no character would be
there and the second '[^ ] wouldn't match?

That's what Stefan said, you'll have to tweak this. Use an alternative '\|',
something like "\(^\|[^ ]\)!\([^ ]\|$\)" (i.e. match at beginning-of-line-
or-space, then "!" then space-or-end-of-line. You can use "\(?: ... \)
if you want non-capturing groups. Watch out for those backslashes: you
want to double them when writing them as an Elisp string [1].

> > Also it's likely that " !! " would also start
> > a comment [...]

> No. In "!!" and "!!a" the comment token is not separate, so no comment.

See? we are all guessing at what your tokens are. Designing the
nitty-gritties of your regexps and testing them can only be your
work, because we'd be all fighting phantoms.

> > Yet another thing: if you have trouble catching all cases with a single
> > regexp, you can use more rules, as in
> >
> >    (syntax-propertize-rules
> >     ("[a-zA-Z0-9_]\\(! \\) " (1 "_"))
> >     ("\\(!\\)[a-zA-Z0-9_]" (1 "_")))
> 
> It still messes up the comment font-locking. BTW I've noticed that if
> the regexp is "test\\(!\\)" emacs correctly does not use comment face
> on "test!". But if it is "\\(!\\)test" then "!test" is still seen as a
> comment. That is inconsistent with what you have explained and the
> elisp manual. So I think it is a bug.

I don't understand you. What is this "test", where does it come from
and what is it doing *in* the regular expression? (and in which one:
in syntax-propertize rules, as in Stefan's example above, or somewhere
else?

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlVxR0wACgkQBcgs9XrR2kZXwwCdGe35a21eiwMqPmJ/xYmnjd5H
RK8AniLaRZ3iKcGU1ah3uAPeJNpERABf
=vTni
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Font-lock of comments using comment tokens, does it work?
  2015-06-05  6:53             ` tomas
@ 2015-06-05 19:37               ` Björn Lindqvist
  2015-06-07  3:58                 ` Björn Lindqvist
  0 siblings, 1 reply; 11+ messages in thread
From: Björn Lindqvist @ 2015-06-05 19:37 UTC (permalink / raw)
  To: tomas; +Cc: help-gnu-emacs, Stefan Monnier

>> No. I want two strings, FOO and BAR (or ! doesn't matter, same
>> principle) to start comments iff they are separate tokens.
>
> Sorry to jump in the middle. I've been lurking in case I could help
> (and to learn about font-lock).
>
> Björn: you are assuming that everyone knows what's a "token" to you.
> And you are assuming that everyone has the time to read and grasp
> all your examples, first time. I for one don't know what your tokens
> are. To put one extreme example, to C, the string 'a+b' are three
> tokens, '++a' are two; for Lisp, the first and the second example are
> both just *one* token.

It's not necessary to know what the tokenization rules for my language
are to help me. Though the rules are very simple, each
whitespace-separated sequence of characters is one token. But if you
can just come up with the required syntax-propertize-rules and syntax
table setup to make the first for lines of my example highlight as
comments and the last for *not* highlight as comments, I would be
happy with that:

    random code ,FOO random comment
    stuff ,BAR comment "with stuff"
    ,BAR FOO BAR
    ,FOO

    FOObar random come
    BARFOO random code
    random code xyFOOzw
    random code "with string FOO " etc ...

Here I've added a comma to show where the start of the comment-face
should be. The comma is not present in the output. You can even take
this skeleton mode I wrote and just figure out what to write inside
the regexp:

(defun mm-syntax-propertize (start end)
  (funcall (syntax-propertize-rules ("WHAT HERE?" (1 "_"))) start end))

(defvar mm-mode-syntax-table
  (let ((table (make-syntax-table prog-mode-syntax-table)))
    (modify-syntax-entry ?\n  ">   " table)
    (modify-syntax-entry ?! "< " table)
    table))

(define-derived-mode mm-mode prog-mode "Foo"
  (setq-local font-lock-defaults '(()))
  (setq-local syntax-propertize-function 'mm-syntax-propertize))

Here "!" is the comment token, but I'm also trying to get it work for
N arbitrary comment tokens. Also I've worked on this problem for a
long time (check the date on my stackoverflow posting) and tried
dozens of different approaches. So spitballing guesses (I've tried all
guesses so far and many permutations of them and none have worked) or
telling me to rtfm again is pointless.

-- 
mvh/best regards Björn Lindqvist

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Font-lock of comments using comment tokens, does it work?
  2015-06-05 19:37               ` Björn Lindqvist
@ 2015-06-07  3:58                 ` Björn Lindqvist
  2015-06-07 11:34                   ` tomas
  0 siblings, 1 reply; 11+ messages in thread
From: Björn Lindqvist @ 2015-06-07  3:58 UTC (permalink / raw)
  To: tomas; +Cc: help-gnu-emacs, Stefan Monnier

Ok so I finally almost figured it out. The key part was that you must
invert the logic so that instead of "unmarking" in
syntax-propertize-rules, you use a regexp that adds the comment
starter property "<" to the matched strings. Something is bugged with
the unmarking approach, it's like it stops looking when it found the
comment character. Anyway:

    (syntax-propertize-rules
       ("\\(^\\| \\|\t\\)\\(FOO\\|BAR\\)\\($\\| \\|\t\\)" (2 "<   ")))

Does almost exactly what I want. Amusingly enough the matching is
case-INsensitive so bAR, Bar, foo, etc matches the regexp. But it's a
small flaw which I can live with.

-- 
mvh/best regards Björn Lindqvist

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Font-lock of comments using comment tokens, does it work?
  2015-06-07  3:58                 ` Björn Lindqvist
@ 2015-06-07 11:34                   ` tomas
  0 siblings, 0 replies; 11+ messages in thread
From: tomas @ 2015-06-07 11:34 UTC (permalink / raw)
  To: Björn Lindqvist; +Cc: help-gnu-emacs, Stefan Monnier

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sun, Jun 07, 2015 at 05:58:45AM +0200, Björn Lindqvist wrote:
> Ok so I finally almost figured it out. The key part was that you must
> invert the logic so that instead of "unmarking" in
> syntax-propertize-rules, you use a regexp that adds the comment
> starter property "<" to the matched strings. Something is bugged with
> the unmarking approach, it's like it stops looking when it found the
> comment character. Anyway:
> 
>     (syntax-propertize-rules
>        ("\\(^\\| \\|\t\\)\\(FOO\\|BAR\\)\\($\\| \\|\t\\)" (2 "<   ")))

Got it.

> Does almost exactly what I want. Amusingly enough the matching is
> case-INsensitive so bAR, Bar, foo, etc matches the regexp. But it's a
> small flaw which I can live with.

This is most probably related to the (buffer local) variable case-fold-search,
which controls whether the regexp search functions are case sensitive or not.
By default, it's set to t (that's what you usually expect interactively).

You could try to set this variable to nil and see whether it works better.

Alas, there doesn't seem to be a way to control that in the simple
syntax-propertize-rules -- possibly you'll have to go the way of
defining a syntax-propertize-function, where you can set this variable
dynamically for the relevant call.

Ducking around, here's a little snippet which might help to get you
started:

  <http://www.lunaryorn.com/feed.atom#hooking-into-syntactic-analyses>

HTH, regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlV0LFgACgkQBcgs9XrR2kaEvwCcCR5xyTagZ/ihFBj7iv9keKTl
pVAAnjRjsvahbfSW4+1sJ0UEconY1lz2
=5Ywn
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-06-07 11:34 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <mailman.4238.1433357678.904.help-gnu-emacs@gnu.org>
2015-06-03 19:11 ` Font-lock of comments using comment tokens, does it work? Stefan Monnier
2015-06-03 23:37   ` Björn Lindqvist
     [not found]   ` <mailman.4244.1433374628.904.help-gnu-emacs@gnu.org>
2015-06-04  3:42     ` Stefan Monnier
2015-06-04 11:10       ` Björn Lindqvist
     [not found]       ` <mailman.4276.1433416248.904.help-gnu-emacs@gnu.org>
2015-06-04 22:11         ` Stefan Monnier
2015-06-05  3:29           ` Björn Lindqvist
2015-06-05  6:53             ` tomas
2015-06-05 19:37               ` Björn Lindqvist
2015-06-07  3:58                 ` Björn Lindqvist
2015-06-07 11:34                   ` tomas
2015-06-03 16:46 Björn Lindqvist

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).