Re: regexp does not work as documented

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Re: regexp does not work as documented
@ 2008-05-06  4:20 Chong Yidong
  2008-05-06 11:35 ` Bruno Haible
  2008-05-06 15:00 ` David Koppelman
  0 siblings, 2 replies; 30+ messages in thread
From: Chong Yidong @ 2008-05-06  4:20 UTC (permalink / raw)
  To: emacs-devel; +Cc: koppel, 192, Bruno Haible

> $ emacs fr.po
> M-x highlight-regexp
>
> Regexp to highlight (enter this with Ctrl-Q Ctrl-J for each of the two
> newlines):
>
> ^msgstr\(\[[0-9]\]\)?.*
> \(".*
> \)*
>
> Highlight using face: hi-yellow
>
> Then move around in the buffer and look which lines are highlighted.
> In the first match already, only 5 out of 11 lines are highlighted.

I believe this bug arises because highlight-regexp uses font-lock to
highlight the regular expression, and the font-lock engine is
intentionally limiting the region to search for the multi-line regular
expression.

OTOH, I don't see what we can do about this problem.  Maybe we could add
a note to the docstring of highlight-regexp saying that multi-line
regular expressions are problematic?  Does anyone have a suggestion?

BTW, here is a simplified recipe, for those who didn't download the
attached file:

1. Copy the following text, between the "---...----" lines, into a
   buffer

------------------
# Messages français pour GNU gettext.
# Copyright © 2006 Free Software Foundation, Inc.
# François Pinard <pinard@iro.umontreal.ca>, 1996.
#
#
msgid ""
msgstr ""
"Project-Id-Version: GNU gettext-tools 0.16.2-pre5\n"
"Report-Msgid-Bugs-To: bug-gnu-gettext@gnu.org\n"
"POT-Creation-Date: 2007-11-02 03:23+0100\n"
"PO-Revision-Date: 2007-10-27 13:35+0200\n"
"Last-Translator: Christophe Combelles <ccomb@free.fr>\n"
"Language-Team: French <traduc@traduc.org>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=(n > 1);\n"
------------------

2. M-: (highlight-regexp "^m.*\n\\(\".*\n\\)+") RET

Note that the last two lines remain unhighlighted.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-06  4:20 regexp does not work as documented Chong Yidong
@ 2008-05-06 11:35 ` Bruno Haible
  2008-05-06 12:12   ` martin rudalics
  2008-05-06 15:35   ` Stefan Monnier
  2008-05-06 15:00 ` David Koppelman
  1 sibling, 2 replies; 30+ messages in thread
From: Bruno Haible @ 2008-05-06 11:35 UTC (permalink / raw)
  To: Chong Yidong, emacs-devel; +Cc: koppel, 192

Chong Yidong wrote:
> BTW, here is a simplified recipe, for those who didn't download the
> attached file:
> 
> 1. Copy the following text, between the "---...----" lines, into a
>    buffer
> 
> ------------------
> # Messages français pour GNU gettext.
> # Copyright © 2006 Free Software Foundation, Inc.
> # François Pinard <pinard@iro.umontreal.ca>, 1996.
> #
> #
> msgid ""
> msgstr ""
> "Project-Id-Version: GNU gettext-tools 0.16.2-pre5\n"
> "Report-Msgid-Bugs-To: bug-gnu-gettext@gnu.org\n"
> "POT-Creation-Date: 2007-11-02 03:23+0100\n"
> "PO-Revision-Date: 2007-10-27 13:35+0200\n"
> "Last-Translator: Christophe Combelles <ccomb@free.fr>\n"
> "Language-Team: French <traduc@traduc.org>\n"
> "MIME-Version: 1.0\n"
> "Content-Type: text/plain; charset=UTF-8\n"
> "Content-Transfer-Encoding: 8bit\n"
> "Plural-Forms: nplurals=2; plural=(n > 1);\n"
> ------------------
> 
> 2. M-: (highlight-regexp "^m.*\n\\(\".*\n\\)+") RET
> 
> Note that the last two lines remain unhighlighted.

Yes. I reproduce with this simpler recipe as well. Thank you.

> I believe this bug arises because highlight-regexp uses font-lock to
> highlight the regular expression, and the font-lock engine is
> intentionally limiting the region to search for the multi-line regular
> expression.

You are right that there is a limit, but it is set to 200000:
highlight-regexp is aliased to hi-lock-face-buffer, which asks for the
arguments and calls hi-lock-set-pattern. hi-lock-set-pattern does little
more than applying a margin of 100000 and calling re-search-forward.

I believe the origin of the bug is deeper, because
  - the limit of 100000 is way larger than the little snippet you posted,
  - I originally observed the bug in po-mode (part of GNU gettext), in
    a function po-find-span-of-entry which essentially only calls
    re-search-backward and re-search-forward.

> OTOH, I don't see what we can do about this problem.  Maybe we could add
> a note to the docstring of highlight-regexp saying that multi-line
> regular expressions are problematic?

Can someone help me find a workaround, then? If not, I would have to give up
maintaining po-mode as part of GNU gettext. Said function is central in
Emacs po-mode (everything else relies on it), and if multi-line regular
expressions don't work, I don't know how this function could be rewritten.

Bruno





^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-06 11:35 ` Bruno Haible
@ 2008-05-06 12:12   ` martin rudalics
  2008-05-10 19:18     ` David Koppelman
  2008-05-06 15:35   ` Stefan Monnier
  1 sibling, 1 reply; 30+ messages in thread
From: martin rudalics @ 2008-05-06 12:12 UTC (permalink / raw)
  To: Bruno Haible; +Cc: Chong Yidong, 192, koppel, emacs-devel

 > Can someone help me find a workaround, then? If not, I would have to give up
 > maintaining po-mode as part of GNU gettext. Said function is central in
 > Emacs po-mode (everything else relies on it), and if multi-line regular
 > expressions don't work, I don't know how this function could be rewritten.

Don't worry, Stefan will find the solution.  First of all you will
probably have to

(setq font-lock-multiline t)

in the respective buffer.  This will _not_ always DTRT after a buffer
modification, as, for example, in

AAAA

CCCC

BBBB

where AAAA stands for some old text previously matched by your regexp,
CCCC for some new text inserted (or old text removed), and BBBB for some
text which, after the change, is now matched by the regexp (or not
matched any more): In this case BBBB will be wrongly highlighted now.
Alan uses the notorious

`font-lock-extend-jit-lock-region-after-change'

function to handle this, but it's not immediately clear how to apply
this here.  If everything else fails you will have to refontify till
`window-end' (I prefer using a timer for such refontifications).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-06  4:20 regexp does not work as documented Chong Yidong
  2008-05-06 11:35 ` Bruno Haible
@ 2008-05-06 15:00 ` David Koppelman
  2008-05-06 21:35   ` Bruno Haible
  1 sibling, 1 reply; 30+ messages in thread
From: David Koppelman @ 2008-05-06 15:00 UTC (permalink / raw)
  To: Chong Yidong; +Cc: 192, Bruno Haible, emacs-devel

Later in the week I'll look into it and provide either a fix
or document the limitation.

Chong Yidong <cyd@stupidchicken.com> writes:

> OTOH, I don't see what we can do about this problem.  Maybe we could add
> a note to the docstring of highlight-regexp saying that multi-line
> regular expressions are problematic?  Does anyone have a suggestion?



Chong Yidong <cyd@stupidchicken.com> writes:

>> $ emacs fr.po
>> M-x highlight-regexp
>>
>> Regexp to highlight (enter this with Ctrl-Q Ctrl-J for each of the two
>> newlines):
>>
>> ^msgstr\(\[[0-9]\]\)?.*
>> \(".*
>> \)*
>>
>> Highlight using face: hi-yellow
>>
>> Then move around in the buffer and look which lines are highlighted.
>> In the first match already, only 5 out of 11 lines are highlighted.
>
> I believe this bug arises because highlight-regexp uses font-lock to
> highlight the regular expression, and the font-lock engine is
> intentionally limiting the region to search for the multi-line regular
> expression.
>
> OTOH, I don't see what we can do about this problem.  Maybe we could add
> a note to the docstring of highlight-regexp saying that multi-line
> regular expressions are problematic?  Does anyone have a suggestion?
>
>
> BTW, here is a simplified recipe, for those who didn't download the
> attached file:
>
> 1. Copy the following text, between the "---...----" lines, into a
>    buffer
>
> ------------------
> # Messages français pour GNU gettext.
> # Copyright © 2006 Free Software Foundation, Inc.
> # François Pinard <pinard@iro.umontreal.ca>, 1996.
> #
> #
> msgid ""
> msgstr ""
> "Project-Id-Version: GNU gettext-tools 0.16.2-pre5\n"
> "Report-Msgid-Bugs-To: bug-gnu-gettext@gnu.org\n"
> "POT-Creation-Date: 2007-11-02 03:23+0100\n"
> "PO-Revision-Date: 2007-10-27 13:35+0200\n"
> "Last-Translator: Christophe Combelles <ccomb@free.fr>\n"
> "Language-Team: French <traduc@traduc.org>\n"
> "MIME-Version: 1.0\n"
> "Content-Type: text/plain; charset=UTF-8\n"
> "Content-Transfer-Encoding: 8bit\n"
> "Plural-Forms: nplurals=2; plural=(n > 1);\n"
> ------------------
>
> 2. M-: (highlight-regexp "^m.*\n\\(\".*\n\\)+") RET
>
> Note that the last two lines remain unhighlighted.




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-06 11:35 ` Bruno Haible
  2008-05-06 12:12   ` martin rudalics
@ 2008-05-06 15:35   ` Stefan Monnier
  2008-05-06 21:29     ` Bruno Haible
  2008-05-10 20:04     ` Bruno Haible
  1 sibling, 2 replies; 30+ messages in thread
From: Stefan Monnier @ 2008-05-06 15:35 UTC (permalink / raw)
  To: Bruno Haible; +Cc: Chong Yidong, 192, koppel, emacs-devel

> You are right that there is a limit, but it is set to 200000:
> highlight-regexp is aliased to hi-lock-face-buffer, which asks for the
> arguments and calls hi-lock-set-pattern.  hi-lock-set-pattern does little
> more than applying a margin of 100000 and calling re-search-forward.

Actually, font-lock-fontified is most likely set to t, so
hi-lock-set-pattern doesn't call re-sarch-forward at all and only calls
font-lock-fontify-buffer instead.

>   - I originally observed the bug in po-mode (part of GNU gettext), in
>     a function po-find-span-of-entry which essentially only calls
>     re-search-backward and re-search-forward.

Please try and reproduce the problem there and send us a recipe.


        Stefan




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-06 15:35   ` Stefan Monnier
@ 2008-05-06 21:29     ` Bruno Haible
  2008-05-10 20:04     ` Bruno Haible
  1 sibling, 0 replies; 30+ messages in thread
From: Bruno Haible @ 2008-05-06 21:29 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Chong Yidong, 192, koppel, emacs-devel

Stefan Monnier wrote:
> Actually, font-lock-fontified is most likely set to t, so
> hi-lock-set-pattern doesn't call re-sarch-forward at all and only calls
> font-lock-fontify-buffer instead.

You're right: If I do a
  M-x evaluate-expression  (setq font-lock-fontify-buffer nil)
before
  M-x highlight-regexp
the result is correct. So the problem is indeed with the font-locking.

Bruno





^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-06 15:00 ` David Koppelman
@ 2008-05-06 21:35   ` Bruno Haible
  2008-05-07  1:04     ` Stefan Monnier
  2008-05-07  1:08     ` Auto-discovery of multi-line font-lock regexps Stefan Monnier
  0 siblings, 2 replies; 30+ messages in thread
From: Bruno Haible @ 2008-05-06 21:35 UTC (permalink / raw)
  To: David Koppelman, Chong Yidong; +Cc: 192, emacs-devel

David Koppelman wrote:
> Later in the week I'll look into it and provide either a fix
> or document the limitation.

Thank you!

Chong Yidong wrote:
> > OTOH, I don't see what we can do about this problem.  Maybe we could add
> > a note to the docstring of highlight-regexp saying that multi-line
> > regular expressions are problematic?  Does anyone have a suggestion?

As an end user, for testing the effect of a regexp on a buffer interactively,
I would prefer to have a "volatile" coloring (i.e. one that disappears at the
next buffer modification) but is correct, rather than a documented-to-be-wrong
coloring that updates itself correctly during buffer modifications. Less
functionality but implemented correctly.

OTOH, third-party packages may prefer the current behaviour if their regexps
match only portions of a line.

Bruno

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-06 21:35   ` Bruno Haible
@ 2008-05-07  1:04     ` Stefan Monnier
  2008-05-07  1:08     ` Auto-discovery of multi-line font-lock regexps Stefan Monnier
  1 sibling, 0 replies; 30+ messages in thread
From: Stefan Monnier @ 2008-05-07  1:04 UTC (permalink / raw)
  To: Bruno Haible; +Cc: David Koppelman, 192, Chong Yidong, emacs-devel

> As an end user, for testing the effect of a regexp on a buffer interactively,
> I would prefer to have a "volatile" coloring (i.e. one that disappears at the
> next buffer modification) but is correct, rather than a documented-to-be-wrong
> coloring that updates itself correctly during buffer modifications. Less
> functionality but implemented correctly.

Actually, we can get the combination of the two.  hilight-changes
(c|sh)ould use its own loop with re-search-forward, even when font-lock
is enabled.  This way the highlighting would be initially correct, and
in some cases it would also be correctly preserved/discovered later on.

This would only be used for regexp that can span multiple lines, so
highlight-changes (c|sh)ould analyse the regexp to see if there's
a possibility for it to match multiple lines.

        Stefan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Auto-discovery of multi-line font-lock regexps
  2008-05-06 21:35   ` Bruno Haible
  2008-05-07  1:04     ` Stefan Monnier
@ 2008-05-07  1:08     ` Stefan Monnier
  2008-05-07  3:46       ` Chong Yidong
  1 sibling, 1 reply; 30+ messages in thread
From: Stefan Monnier @ 2008-05-07  1:08 UTC (permalink / raw)
  To: emacs-devel

While reading the "regexp does not work as documented" thread, an idea
came to me: we could have an idle task that takes the font-lock regexps
and instead of applying them directly, only uses them to find matches
that span multiple lines and mark them with the
`font-lock-multiline' property.

The idea is that the font-lock-multiline works OK to preserve multiline
matches, so the real difficulty is in making sure we discover
them correctly.  Sometimes we do by happenstance, sometimes we do
because the major-mode was careful to make it work (which is far from
trivial), but often we just don't.  Having such a background loop would
be very helpful.  Its job can easily be stopped at any time, so it
shouldn't introduce long latencies.

        Stefan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Auto-discovery of multi-line font-lock regexps
  2008-05-07  1:08     ` Auto-discovery of multi-line font-lock regexps Stefan Monnier
@ 2008-05-07  3:46       ` Chong Yidong
  2008-05-07  4:21         ` Stefan Monnier
  0 siblings, 1 reply; 30+ messages in thread
From: Chong Yidong @ 2008-05-07  3:46 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> While reading the "regexp does not work as documented" thread, an idea
> came to me: we could have an idle task that takes the font-lock regexps
> and instead of applying them directly, only uses them to find matches
> that span multiple lines and mark them with the
> `font-lock-multiline' property.
>
> The idea is that the font-lock-multiline works OK to preserve multiline
> matches, so the real difficulty is in making sure we discover
> them correctly.  Sometimes we do by happenstance, sometimes we do
> because the major-mode was careful to make it work (which is far from
> trivial), but often we just don't.  Having such a background loop would
> be very helpful.  Its job can easily be stopped at any time, so it
> shouldn't introduce long latencies.

Sounds like a good idea, but wouldn't that run into the same problem
with the JIT lock stealth timer that necessitated setting
jit-lock-stealth-time to nil (i.e., people on laptops complaining about
Emacs eating CPU)?




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Auto-discovery of multi-line font-lock regexps
  2008-05-07  3:46       ` Chong Yidong
@ 2008-05-07  4:21         ` Stefan Monnier
  0 siblings, 0 replies; 30+ messages in thread
From: Stefan Monnier @ 2008-05-07  4:21 UTC (permalink / raw)
  To: Chong Yidong; +Cc: emacs-devel

> Sounds like a good idea, but wouldn't that run into the same problem
> with the JIT lock stealth timer that necessitated setting
> jit-lock-stealth-time to nil (i.e., people on laptops complaining about
> Emacs eating CPU)?

Of course.  Except it would actually make a difference w.r.t behavior
rather than just performance.  I expect it would only be enabled in some
particular buffers where it proves necessary.  Maybe the problematic
regexps could be specially labelled in font-lock-keywords so that only
relevant regexps get this kind of treatment.

        Stefan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-06 12:12   ` martin rudalics
@ 2008-05-10 19:18     ` David Koppelman
  2008-05-10 20:13       ` David Koppelman
  0 siblings, 1 reply; 30+ messages in thread
From: David Koppelman @ 2008-05-10 19:18 UTC (permalink / raw)
  To: martin rudalics; +Cc: Chong Yidong, 192, Bruno Haible, emacs-devel

I was able to reproduce the problem with Bruno Haible's testcase and
font-lock-multiline t does "fix" it. However martin rudalics warns that
font-lock-multiline won't work for all cases and provides an example
idea (below). I can't get that to fail. That is, with
font-lock-multiline t the text was correctly fontified (though after a
pause). My realization of the example was to remove and then add the
first quotation mark from one of the interior lines below (also
tried with more quoted lines):

msgstr ""
"Project-Id-Version: GNU gettext-tools 0.16.2-pre5\n"
"Report-Msgid-Bugs-To: bug-gnu-gettext@gnu.org\n"
"POT-Creation-Date: 2007-11-02 03:23+0100\n"
"PO-Revision-Date: 2007-10-27 13:35+0200\n"
"Last-Translator: Christophe Combelles <ccomb@free.fr>\n"
"Last-Translator: Christophe Combelles <ccomb@free.fr>\n"
x

The fix I'm contemplating would be to warn the user when a multi-line
regexp was added interactively and font-lock-multiline was nil, and then
perhaps to offer to set font-lock-multiline to t (or to not set it, or
to stop asking).




martin rudalics <rudalics@gmx.at> writes:

>> Can someone help me find a workaround, then? If not, I would have to give up
>> maintaining po-mode as part of GNU gettext. Said function is central in
>> Emacs po-mode (everything else relies on it), and if multi-line regular
>> expressions don't work, I don't know how this function could be rewritten.
>
> Don't worry, Stefan will find the solution.  First of all you will
> probably have to
>
> (setq font-lock-multiline t)
>
> in the respective buffer.  This will _not_ always DTRT after a buffer
> modification, as, for example, in
>
> AAAA
>
> CCCC
>
> BBBB
>
> where AAAA stands for some old text previously matched by your regexp,
> CCCC for some new text inserted (or old text removed), and BBBB for some
> text which, after the change, is now matched by the regexp (or not
> matched any more): In this case BBBB will be wrongly highlighted now.
> Alan uses the notorious
>
> `font-lock-extend-jit-lock-region-after-change'
>
> function to handle this, but it's not immediately clear how to apply
> this here.  If everything else fails you will have to refontify till
> `window-end' (I prefer using a timer for such refontifications).




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-06 15:35   ` Stefan Monnier
  2008-05-06 21:29     ` Bruno Haible
@ 2008-05-10 20:04     ` Bruno Haible
  1 sibling, 0 replies; 30+ messages in thread
From: Bruno Haible @ 2008-05-10 20:04 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Chong Yidong, 192, koppel, emacs-devel

Stefan Monnier wrote:
> >   - I originally observed the bug in po-mode (part of GNU gettext), in
> >     a function po-find-span-of-entry which essentially only calls
> >     re-search-backward and re-search-forward.
> 
> Please try and reproduce the problem there and send us a recipe.

After deeper investigation, the bug in po-mode was not directly related:
I had used a regexp which was not designed for use with re-search-backward.
This bug is closed now. <https://savannah.gnu.org/bugs/?23177>

Thank you Stefan for bringing me on the right track.

Bruno





^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-10 19:18     ` David Koppelman
@ 2008-05-10 20:13       ` David Koppelman
  2008-05-11  7:40         ` martin rudalics
  0 siblings, 1 reply; 30+ messages in thread
From: David Koppelman @ 2008-05-10 20:13 UTC (permalink / raw)
  To: martin rudalics; +Cc: Chong Yidong, 192, Bruno Haible, emacs-devel

I *am* able to reproduce the font-lock-multiline limitation martin
described if the buffer is in text mode. I had tried to reproduce it
in emacs-lisp mode.

First I'll work on the hi-lock warning as I described below, then I'll
see about detecting and doing something helpful for additional
situations where multi-line won't work.

David Koppelman <koppel@ece.lsu.edu> writes:

> I was able to reproduce the problem with Bruno Haible's testcase and
> font-lock-multiline t does "fix" it. However martin rudalics warns that
> font-lock-multiline won't work for all cases and provides an example
> idea (below). I can't get that to fail. That is, with
> font-lock-multiline t the text was correctly fontified (though after a
> pause). My realization of the example was to remove and then add the
> first quotation mark from one of the interior lines below (also
> tried with more quoted lines):
>
> msgstr ""
> "Project-Id-Version: GNU gettext-tools 0.16.2-pre5\n"
> "Report-Msgid-Bugs-To: bug-gnu-gettext@gnu.org\n"
> "POT-Creation-Date: 2007-11-02 03:23+0100\n"
> "PO-Revision-Date: 2007-10-27 13:35+0200\n"
> "Last-Translator: Christophe Combelles <ccomb@free.fr>\n"
> "Last-Translator: Christophe Combelles <ccomb@free.fr>\n"
> x
>
> The fix I'm contemplating would be to warn the user when a multi-line
> regexp was added interactively and font-lock-multiline was nil, and then
> perhaps to offer to set font-lock-multiline to t (or to not set it, or
> to stop asking).




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-10 20:13       ` David Koppelman
@ 2008-05-11  7:40         ` martin rudalics
  2008-05-11 14:27           ` Chong Yidong
  0 siblings, 1 reply; 30+ messages in thread
From: martin rudalics @ 2008-05-11  7:40 UTC (permalink / raw)
  To: David Koppelman; +Cc: Chong Yidong, 192, Bruno Haible, emacs-devel

 > First I'll work on the hi-lock warning as I described below, then I'll
 > see about detecting and doing something helpful for additional
 > situations where multi-line won't work.

Think of the following pathological case: Devise a regexp to highlight
the first line of a buffer provided the buffer does not end with a
newline.  Doing this with `font-lock-multiline' hardly makes any sense.

Maybe users should classify whether a regexp they use

(1) doesn't match newlines - no `font-lock-multiline' needed,

(2) match at most n newlines in which case you should tell font-lock to
rescan from n lines before each buffer change (with large n the display
engine will suffer noticeably, mainly because font-lock has to search
for all other keywords as well), or

(3) may match more than n newlines in which case you should use an idle
timer to scan the entire buffer for any matches of such regexps and
highlight them separately.





^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-11  7:40         ` martin rudalics
@ 2008-05-11 14:27           ` Chong Yidong
  2008-05-11 15:36             ` David Koppelman
  2008-05-11 18:44             ` Stefan Monnier
  0 siblings, 2 replies; 30+ messages in thread
From: Chong Yidong @ 2008-05-11 14:27 UTC (permalink / raw)
  To: martin rudalics; +Cc: David Koppelman, 192, Bruno Haible, emacs-devel

martin rudalics <rudalics@gmx.at> writes:

>> First I'll work on the hi-lock warning as I described below, then I'll
>> see about detecting and doing something helpful for additional
>> situations where multi-line won't work.
>
> Think of the following pathological case: Devise a regexp to highlight
> the first line of a buffer provided the buffer does not end with a
> newline.  Doing this with `font-lock-multiline' hardly makes any sense.

Ideally, highlight-regexp should work automagically, instead of forcing
users to do something extra to make their multi-line regexp work
properly.  The right way to do this is probably for hi-lock-mode to
process the buffer initially, setting up text properties to make
font-lock DTRT even for multi-line expressions.  But that's a big job.

As for making hi-lock-mode detect whether or not a regexp is multi-line,
isn't that a computationally non-trivial problem?

Maybe making hi-lock-mode turn on font-lock-multiline, while not
foolproof, works often enough to be satisfactory.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-11 14:27           ` Chong Yidong
@ 2008-05-11 15:36             ` David Koppelman
  2008-05-11 18:44               ` Stefan Monnier
  2008-05-11 18:44             ` Stefan Monnier
  1 sibling, 1 reply; 30+ messages in thread
From: David Koppelman @ 2008-05-11 15:36 UTC (permalink / raw)
  To: Chong Yidong; +Cc: martin rudalics, 192, Bruno Haible, emacs-devel

I agree pretty much with everything Chong Yidong writes.

I rather not bother the user with an additional question if I don't
have to, the alternative would be a warning.

My latest plan is to do what Chong Yidong suggests, setting up text
properties so that font-lock DTRT, though it doesn't seem as hard as he
suggests (I'm still in the naive enthusiasm stage). I tried adding the
font-lock-multiline property to the face property list passed to font
lock and that did the trick, even with the font-lock-multiline variable
nil. I rather do that than turn on font-lock-multiline because I'm
assuming that font-lock-multiline is set to nil in most cases for a good
reason.

Rather than perfectly distinguishing multi-line from single line
patterns guessing would be good enough for hi-lock. I'm using the
following regexp,
"\\(\n.\\|\\\\W[*+]\\|\\\\[SC].[*+]\\|\\[\\^[^]]+\\][+*]\\)", which
hopefully isn't too far from covering a large majority of interactively
entered patterns.

I actually thought about properly parsing the regexp, but the effort to
do that could be spent on making multi-line patterns work properly, at
least if they don't span too many lines.

One more thing, multi-line regexp matches don't work properly even with
font-lock-multiline t when jit-lock is being used in a buffer without
syntactic fontification and using the default setting of
jit-lock-contextually, setting it to t gets multi-line fontification to
work.

I plan to play around a bit more and come up with something,
maybe today, maybe early this week.

Chong Yidong <cyd@stupidchicken.com> writes:

> Ideally, highlight-regexp should work automagically, instead of forcing
> users to do something extra to make their multi-line regexp work
> properly.  The right way to do this is probably for hi-lock-mode to
> process the buffer initially, setting up text properties to make
> font-lock DTRT even for multi-line expressions.  But that's a big job.
>
> As for making hi-lock-mode detect whether or not a regexp is multi-line,
> isn't that a computationally non-trivial problem?
>
> Maybe making hi-lock-mode turn on font-lock-multiline, while not
> foolproof, works often enough to be satisfactory.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-11 15:36             ` David Koppelman
@ 2008-05-11 18:44               ` Stefan Monnier
  2008-05-11 19:09                 ` David Koppelman
  0 siblings, 1 reply; 30+ messages in thread
From: Stefan Monnier @ 2008-05-11 18:44 UTC (permalink / raw)
  To: David Koppelman
  Cc: martin rudalics, Chong Yidong, 192, Bruno Haible, emacs-devel

> My latest plan is to do what Chong Yidong suggests, setting up text
> properties so that font-lock DTRT, though it doesn't seem as hard as he
> suggests (I'm still in the naive enthusiasm stage).

Indeed, it shouldn't be that hard.

> I tried adding the font-lock-multiline property to the face property
> list passed to font lock and that did the trick, even with the
> font-lock-multiline variable nil.

That may not be enough.  You'll probably want to do something like what
smerge does:

  (while (re-search-forward <RE> nil t)
    (font-lock-fontify-region (match-beginning 0) (match-end 0)))

this will find all the multiline elements.  And the font-lock-multiline
property you add will make sure that those that were found will not
disappear accidentally because of some later refontification.

> I rather do that than turn on font-lock-multiline because I'm assuming
> that font-lock-multiline is set to nil in most cases for
> a good reason.

Setting the `font-lock-multiline' variable to t has a performance cost.

> I actually thought about properly parsing the regexp, but the effort to
> do that could be spent on making multi-line patterns work properly, at
> least if they don't span too many lines.

If someone wants that, I have a parser that takes a regexp and turns it
into something like `rx' syntax.  It uses my lex.el library (which
takes an `rx'-like input syntax).

> One more thing, multi-line regexp matches don't work properly even
> with font-lock-multiline t when jit-lock is being used in a buffer
> without syntactic fontification and using the default setting of
> jit-lock-contextually, setting it to t gets multi-line fontification
> to work.

The `font-lock-multiline' variable only tells font-lock that if it ever
bumps into a multiline element, it should mark it (with the
font-lock-multiline property) so that it will not re-fontify it as
a whole if it ever needs to refontify it.

So it doesn't solve the problem of "how do I make sure that font-lock
indeed finds the multiline element".  Multiline elements can only be
found when font-locking a large enough piece of text, which tends to
only happen during the initial fontification, or during background or
contextual refontification, or during an explicit call such as in the
above while loop.

        Stefan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-11 14:27           ` Chong Yidong
  2008-05-11 15:36             ` David Koppelman
@ 2008-05-11 18:44             ` Stefan Monnier
  2008-05-11 20:03               ` Thomas Lord
  1 sibling, 1 reply; 30+ messages in thread
From: Stefan Monnier @ 2008-05-11 18:44 UTC (permalink / raw)
  To: Chong Yidong
  Cc: martin rudalics, David Koppelman, 192, Bruno Haible, emacs-devel

> As for making hi-lock-mode detect whether or not a regexp is multi-line,
> isn't that a computationally non-trivial problem?

Well, you can turn the regexp into a DFA, then take the ".*\n.+" regexp,
turn it into another DFA, take the intersection of the two DFAs, and if
it's empty you know your regexp can never match a multiline element.


        Stefan




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-11 18:44               ` Stefan Monnier
@ 2008-05-11 19:09                 ` David Koppelman
  2008-05-12  1:28                   ` Stefan Monnier
  0 siblings, 1 reply; 30+ messages in thread
From: David Koppelman @ 2008-05-11 19:09 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: martin rudalics, Chong Yidong, 192, Bruno Haible, emacs-devel

I've decided against having hi-lock turn on font-lock-multiline or
even apply font-lock-multiline text properties, too much potential to
slow things down to a crawl when an unsuspecting user enters a regexp.

If I understand things correctly, the font-lock-multiline property is
used to extend a region to be fontified, a region to be used for *all*
keywords. This would have disastrous effects when multi-line patterns
span, say, 100's of lines for modes with hundreds of keywords. I had
been toying with the idea of limiting extended regions to something
like 100 lines, but that still seems wasteful when most keywords are
single line (I haven't benchmarked anything yet).

A better solution would be to have font-lock use multi-line extended
regions selectively. Perhaps a hint in the current keyword syntax
(say, explicitly applying the font-lock-multiline property), or a
separate method for providing multi-line keywords to font-lock.
Such keywords would get the multi-line extended regions, the other
just the whole-line extensions (or whatever the hooks do).

Is this something the font-lock maintainers would consider?

What I'll do now is just document the limitations for hi-lock and
perhaps provide a warning when a multiline pattern is used.

> That may not be enough.  You'll probably want to do something like what
> smerge does:
>
>   (while (re-search-forward <RE> nil t)
>     (font-lock-fontify-region (match-beginning 0) (match-end 0)))

I wouldn't do that without suppressing other keywords. 

> If someone wants that, I have a parser that takes a regexp and turns it
> into something like `rx' syntax.  It uses my lex.el library (which
> takes an `rx'-like input syntax).

That sounds useful, either E-mail it to me or let me know
where to find it. 

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> My latest plan is to do what Chong Yidong suggests, setting up text
>> properties so that font-lock DTRT, though it doesn't seem as hard as he
>> suggests (I'm still in the naive enthusiasm stage).
>
> Indeed, it shouldn't be that hard.
>
>> I tried adding the font-lock-multiline property to the face property
>> list passed to font lock and that did the trick, even with the
>> font-lock-multiline variable nil.
>
> That may not be enough.  You'll probably want to do something like what
> smerge does:
>
>   (while (re-search-forward <RE> nil t)
>     (font-lock-fontify-region (match-beginning 0) (match-end 0)))
>
> this will find all the multiline elements.  And the font-lock-multiline
> property you add will make sure that those that were found will not
> disappear accidentally because of some later refontification.
>
>> I rather do that than turn on font-lock-multiline because I'm assuming
>> that font-lock-multiline is set to nil in most cases for
>> a good reason.
>
> Setting the `font-lock-multiline' variable to t has a performance cost.
>
>> I actually thought about properly parsing the regexp, but the effort to
>> do that could be spent on making multi-line patterns work properly, at
>> least if they don't span too many lines.
>
> If someone wants that, I have a parser that takes a regexp and turns it
> into something like `rx' syntax.  It uses my lex.el library (which
> takes an `rx'-like input syntax).
>
>> One more thing, multi-line regexp matches don't work properly even
>> with font-lock-multiline t when jit-lock is being used in a buffer
>> without syntactic fontification and using the default setting of
>> jit-lock-contextually, setting it to t gets multi-line fontification
>> to work.
>
> The `font-lock-multiline' variable only tells font-lock that if it ever
> bumps into a multiline element, it should mark it (with the
> font-lock-multiline property) so that it will not re-fontify it as
> a whole if it ever needs to refontify it.
>
> So it doesn't solve the problem of "how do I make sure that font-lock
> indeed finds the multiline element".  Multiline elements can only be
> found when font-locking a large enough piece of text, which tends to
> only happen during the initial fontification, or during background or
> contextual refontification, or during an explicit call such as in the
> above while loop.
>
>
>         Stefan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-11 18:44             ` Stefan Monnier
@ 2008-05-11 20:03               ` Thomas Lord
  2008-05-12  1:43                 ` Stefan Monnier
  0 siblings, 1 reply; 30+ messages in thread
From: Thomas Lord @ 2008-05-11 20:03 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: Chong Yidong, 192, emacs-devel, martin rudalics, David Koppelman,
	Bruno Haible

[-- Attachment #1: Type: text/plain, Size: 2842 bytes --]

>> As for making hi-lock-mode detect whether or not a regexp is multi-line,
>> isn't that a computationally non-trivial problem?
>>     
>
> Well, you can turn the regexp into a DFA, then take the ".*\n.+" regexp,
> turn it into another DFA, take the intersection of the two DFAs, and if
> it's empty you know your regexp can never match a multiline element.
>   

If you are going to go that that trouble, perhaps there is a better 
solution:

The Rx pattern matcher found in distributions of GNU Arch has these
relevant capabilities (relevant at least so far as I understand the 
problem you
are trying to solve):

1. It does on-the-fly regexp->DFA conversion, degrading gracefully into
    mixed NFA/DFA mode or pure NFA mode if the DFA would be too
    large.   The calling program gets to say what "too large" is.

2. Although it is a C library, you can capture what is (in essence) the
    continuation of an on-going match.   That is, you can suspend a
    match (or scan) part-way through, then later resume from that point,
    perhaps multiple times.   (This does not involve abusing the C stack.)

3. It does have some Unicode support in there and, though these capabilities
    are under-tested and some features are missing, it is quite flexible 
about
    encoding forms.

4. The DFA construction is "caching" and, for a given regexp, all uses
    will share the DFA construction.   E.g., multiple, suspended regexp
    continuations can be space efficient because they will share state.

5. Because of the caching and structure sharing, you can tell if two 
continuations
    from a single regexp have arrived at the same state with a C EQ test 
("==").

How can this help?

Well, instead of using heuristics to decide where to re-scan from and 
too, you
can cache a record of where the DFA scan arrived at for periodic 
positions in the
buffer.   Then begin scanning from just before any modification for as 
far as it
takes to arrive at a DFA state that is the same as last time, updating 
any highlighting
in the region between those two points.

I don't mean to imply that this is a trivial thing to implement in Emacs but
if you start getting up to building DFAs (very expensive in the worst 
case) and
taking intersections (very expensive in the worst case) -- both also not 
all that
simple to implement (nor obviously possible for Emacs' extended regexp 
language) --
then the effort may be comparable and (re-)visiting the option to adapt 
Rx to Emacs
should be worth considering.

As a point of amusement and credit where due, I think it was Jim Blandy 
who first noticed this
possibility in the early 1990s when I was explaining to him the 
capabilities I
was then just beginning to add to Rx.

This is a very old problem, long recognized, with some work already done on
a (purportedly) Right Thing solution.

-t

[-- Attachment #2: Type: text/html, Size: 3655 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-11 19:09                 ` David Koppelman
@ 2008-05-12  1:28                   ` Stefan Monnier
  2008-05-12 15:03                     ` David Koppelman
  0 siblings, 1 reply; 30+ messages in thread
From: Stefan Monnier @ 2008-05-12  1:28 UTC (permalink / raw)
  To: David Koppelman
  Cc: martin rudalics, Chong Yidong, 192, Bruno Haible, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1670 bytes --]

> A better solution would be to have font-lock use multi-line extended
> regions selectively. Perhaps a hint in the current keyword syntax
> (say, explicitly applying the font-lock-multiline property), or a
> separate method for providing multi-line keywords to font-lock.

I don't understand the difference between the above and the application
of font-lock-multiline properties which you seem to have tried and rejected.

I don't necessarily disagree with your rejection of font-lock-multiline:
it can have disastrous effect indeed if the multiline region becomes large.

> Such keywords would get the multi-line extended regions, the other
> just the whole-line extensions (or whatever the hooks do).
> Is this something the font-lock maintainers would consider?

I guess I simply do not understand what you propose.  Any improvement in
the multiline handling is welcome, but beware: this is not an easy area.

>> (while (re-search-forward <RE> nil t)
>>   (font-lock-fontify-region (match-beginning 0) (match-end 0)))
> I wouldn't do that without suppressing other keywords.

FWIW, I do pretty much exactly the above loop in smerge-mode and
I haven't heard complaints yet.

>> If someone wants that, I have a parser that takes a regexp and turns it
>> into something like `rx' syntax.  It uses my lex.el library (which
>> takes an `rx'-like input syntax).

> That sounds useful, either E-mail it to me or let me know
> where to find it.

Find the current version attached.  Consider it as 99.9% untested
code, tho.  Also you need to eval it before you can byte-compile it.
And I strongly recommend you byte-compile it to reduce the
specpdl usage.

        Stefan

[-- Attachment #2: lex.el --]
[-- Type: application/emacs-lisp, Size: 49609 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-11 20:03               ` Thomas Lord
@ 2008-05-12  1:43                 ` Stefan Monnier
  2008-05-12  3:30                   ` Thomas Lord
  0 siblings, 1 reply; 30+ messages in thread
From: Stefan Monnier @ 2008-05-12  1:43 UTC (permalink / raw)
  To: Thomas Lord
  Cc: Chong Yidong, 192, emacs-devel, martin rudalics, David Koppelman,
	Bruno Haible

> Well, instead of using heuristics to decide where to re-scan from and
> too, you can cache a record of where the DFA scan arrived at for
> periodic positions in the buffer.   Then begin scanning from just
> before any modification for as far as it takes to arrive at a DFA
> state that is the same as last time, updating any highlighting in the
> region between those two points.

That's a very good point.  I'm not sure it's worth the trouble to store
it at various buffer positions and check if it's EQ to stop the rescan,
but at least we could match multiline expression one-line at a time.

In any case, it's indeed a non-trivial amount of work because it
probably requires rewriting not just font-lock but all the
foo-mode-font-lock-keywords as well (font-lock-keywords are order
dependent so you can't apply the rule nb 3 after rule nb 4).

> I don't mean to imply that this is a trivial thing to implement in
> Emacs but if you start getting up to building DFAs (very expensive in
> the worst case) and taking intersections (very expensive in the worst
> case) -- both also not all that simple to implement (nor obviously
> possible for Emacs' extended regexp language) -- then the effort may
> be comparable and (re-)visiting the option to adapt Rx to Emacs should
> be worth considering.

I have most of the DFA construction code written, but I may take you up
on that anyway.  BTW, regarding the "very expensive in the worst case",
how common is this worst case in real life?

        Stefan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-12  1:43                 ` Stefan Monnier
@ 2008-05-12  3:30                   ` Thomas Lord
  2008-05-12 13:43                     ` Stefan Monnier
  0 siblings, 1 reply; 30+ messages in thread
From: Thomas Lord @ 2008-05-12  3:30 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: Chong Yidong, 192, emacs-devel, martin rudalics, David Koppelman,
	Bruno Haible

Stefan Monnier wrote:
> I have most of the DFA construction code written, but I may take you up
> on that anyway.  BTW, regarding the "very expensive in the worst case",
> how common is this worst case in real life?
>   

Talking about general purpose use of an engine, as opposed to something
more narrow like "likely font-lock expressions":

It's common enough in "real life" (in my experience) to be exactly the
minimal amount required to make it annoying.

Let me give you some rules of thumb to describe what I mean but I'll
refrain from a long treatise explaining the theory that supports these.
If you have questions I can "unpack" this on list or off.

First, be careful to make a clear distinction between (let's dub the
distinction) "regexps vs. regular expressions".   "Regular expressions"
correspond to formally regular languages.  Regular expressions can
always be converted to a DFA.   "regexps" are what most people use
in most situations.   regexps include things like features to extract the
positions that match a sub-pattern.   regexps do *not* correspond to
regular languages and *can not* always be converted to DFA.   DFA
conversion can help with regexp matching, but it can't solve it completely.
Moreover, you can make pathological regexps (very slow to match) for
every regexp I know of.   Henry Spencer (I've heard) asserts that regexps
are NP complete or something around there, though I haven't seen the
proof.  (Rx is a hybrid regular expression and regexp engine based on an
on-line (caching, incremental) DFA conversion suggested in the "Dragon"
compiler book and originating from (as I recall) an experiment by
Thompson and one of the other Bell Labs guys.  One of whoever it was
privately mentioned that they abandoned it themselves because it was taking
too much code to implement, or something like that.)

That aside, regular expressions are probably plenty for font-lock 
functionality,
so let's just talk about those.   On-line DFA conversion for those is 
likely
practical by multiple means.

The size of a DFA is, worst-case, exponential in the size of the regular 
expression.
This is the source of pathological cases that can thwart any  DFA 
engine.  (In exponential
cases, Rx keeps running but more like an NFA, taking a corresponding 
performance hit.)

For a time, I was pushing hard on Rx, trying to use it for absolutely 
everything.
To make up new problems to solve as in "Ok, let's suppose I can run 
X-Kbyte long
regular expressions with lots of | operators, stars, etc.   What can I 
apply that to?"
One example is lexing for real-world computer languages.

[Aside: you can also use the same engine as the core of a shift-reduce 
parser, of course -- and I
did that a long time ago with Guile, to useful effect.]

Almost always, the expressions were such that Rx would have no problem 
and give
very pleasing results.   However, there were definitely times (for huge 
regular expressions
and smaller ones) when things would just absolutely crawl.   They happen 
"just often
enough" to be an annoyance.

Now, every case of that annoyance that I found (in real-life 
applications) had a solution.
I could think for a while on *why* the regular expression I wrote was 
blowing up and
then think of a different approach that eliminated the problem.   Most 
often that
meant not a different but equivalent regular expression -- it meant 
going one level up and
changing the way the caller used regular expressions.   The caller would 
retain the same
functionality but different demands would be made on the regular 
expression (or regexp) engine.

And that makes it tricky to drop *blindly* into Emacs.   To solve the 
unavoidable annoying cases
you really had to know how regular expressions worked at a deep level.   
To diagnose
and work around the pathological cases took some expertise.

As a general rule, for something "general purpose" like a libc 
implementation or the
default regexp functions in Emacs lisp, there is something to be said 
for using NFA matchers.
It's perverse why:

Vast swaths of things that a DFA matcher can do very quickly an NFA 
matcher can not.
It's reliably slow.  Therefore, people not prepared to think hard about 
regexps tend to use
an NFA matcher in pretty limited ways.   A powerful regular expression 
engine could
simplify their code, at the cost of risking finding a "hard case" -- but 
there's no issue since
people quickly give up and don't try to push the match engine that hard.

More concisely:

If you use a DFA matcher You Better Know What You're Doing but also
Most of the Time You Won't Need To So It's Very Convenient.

If you use an NFA matcher You Can Get Away With Not Really Knowing What
You're Doing but also You Won't Be Trying Anything Fancy, Perhaps Losing 
Out.

One approach to toss into the mix, this one suggested by Per Bothner 
years ago,
is to consider *offline* DFA conversion (a la 'lex(1)').    The 
advantage of offline
(batch) conversion is that you can burn a lot of cycles on DFA 
minimization and,
if your offline converter terminates, you've got a reliably linear 
matcher.   The
disadvantages for *many* uses of regular expressions in Emacs should be 
pretty obvious.
For something like font-lock, where the regular expressions don't change 
that often,
that might be a good approach -- precompile a minimal DFA and then add 
support
for "regular expression continuations" when using those tables.

-t

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-12  3:30                   ` Thomas Lord
@ 2008-05-12 13:43                     ` Stefan Monnier
  2008-05-12 15:55                       ` Thomas Lord
  0 siblings, 1 reply; 30+ messages in thread
From: Stefan Monnier @ 2008-05-12 13:43 UTC (permalink / raw)
  To: Thomas Lord
  Cc: Chong Yidong, 192, emacs-devel, martin rudalics, David Koppelman,
	Bruno Haible

> years ago, is to consider *offline* DFA conversion (a la 'lex(1)').

That's what I do in lex.el.

> The advantage of offline (batch) conversion is that you can burn a lot
> of cycles on DFA minimization and, if your offline converter
> terminates, you've got a reliably linear matcher.  The disadvantages
> for *many* uses of regular expressions in Emacs should be pretty
> obvious.  For something like font-lock, where the regular expressions
> don't change that often, that might be a good approach -- precompile
> a minimal DFA and then add support for "regular expression
> continuations" when using those tables.

I do not intend to replace src/regexp.c with a matcher based on offline
DFA conversion.  Actually, the need to support backrefs makes it pretty
much impossible (tho I'm sure there's a way to adapt an offline DFA so
it can be used with backrefs), and most importantly it has too different
performance characteristics.  More specifically, the compilation step
should be made explicit.

In any case I think you did answer my question: an offline DFA matcher
is fine, the worst case is not that common and can be worked around.
This is not that different from the current backtracking matcher.

        Stefan

PS: The original motivation for a DFA-matcher is to extend syntax-tables
so they can match match multi-char elements.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-12  1:28                   ` Stefan Monnier
@ 2008-05-12 15:03                     ` David Koppelman
  2008-05-12 16:29                       ` Stefan Monnier
  0 siblings, 1 reply; 30+ messages in thread
From: David Koppelman @ 2008-05-12 15:03 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: martin rudalics, Chong Yidong, 192, Bruno Haible, emacs-devel

> I guess I simply do not understand what you propose.  Any improvement in
> the multiline handling is welcome, but beware: this is not an easy area.

I'm proposing that font-lock divide keywords into two or three
classes, ordinary, multi-line, and maybe mega-line, matches for
multi-line and mega-line keywords would be over much larger
regions. Here is how it might work with two classes (keep in mind that
I don't yet have a thorough understanding of font-lock and jit-lock):

  Multi-line keywords are explicitly identified as such, perhaps
  through keyword syntax or the way they are given to font-lock (say,
  using font-lock-multiline-keywords). Explicit identification avoids
  performance problems from keywords that, though technically
  multi-line, rarely span more than a few lines.

  Functions such as font-lock-default-fontify-region would find two
  sets of extended regions, ordinary and multi, running functions on
  two hooks for this purpose. The multi-line hook might extend the
  region based on the size of the largest supported match rather than
  using the multline property. The multiline property might still be
  useful for non-deferred handling of existing matches.

  Functions such as font-lock-fontify-keywords-region would be passed
  both extended regions and use the region appropriate for each
  keyword they process. The large region is only used on the few
  multi-line patterns that need it.

Here I'm assuming that a mode might have hundreds of single-line (or
two-line) keywords and only a few multi-line keywords, and the
multi-line keywords might span no more than hundreds of lines. We
could guarantee that matches for such patterns are perfect (using a
line-count-limit variable).

If there were a third class, mega-line, it would have its own text
property and region-extension hook.

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> A better solution would be to have font-lock use multi-line extended
>> regions selectively. Perhaps a hint in the current keyword syntax
>> (say, explicitly applying the font-lock-multiline property), or a
>> separate method for providing multi-line keywords to font-lock.
>
> I don't understand the difference between the above and the application
> of font-lock-multiline properties which you seem to have tried and rejected.
>
> I don't necessarily disagree with your rejection of font-lock-multiline:
> it can have disastrous effect indeed if the multiline region becomes large.
>
>> Such keywords would get the multi-line extended regions, the other
>> just the whole-line extensions (or whatever the hooks do).
>> Is this something the font-lock maintainers would consider?
>
> I guess I simply do not understand what you propose.  Any improvement in
> the multiline handling is welcome, but beware: this is not an easy area.
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-12 13:43                     ` Stefan Monnier
@ 2008-05-12 15:55                       ` Thomas Lord
  2008-05-12 16:18                         ` tomas
  0 siblings, 1 reply; 30+ messages in thread
From: Thomas Lord @ 2008-05-12 15:55 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: Chong Yidong, 192, emacs-devel, martin rudalics, David Koppelman,
	Bruno Haible

Stefan Monnier wrote:
> That's what I do in lex.el.
>
>   

Sounds nice.  

Last bits of experience report, then:

If it isn't so already, it may be easy to make it so
that a choice of which DFA is being used, plus a choice of the
"current state" can be represented as lisp objects and cheaply
copied.  That gives the essence of "regular expression continuations".

Handy features that shouldn't be difficult to add (if not present):

Let programmers specify "labels" for each NFA state and then,
for each DFA state, have either a list of all NFA labels that
correspond to that DFA state and/or a more general way to
"combine" NFA state labels to make the DFA label.  You can
wind up with many NFA states combined to a single DFA state,
of course, so a "combine" function might be important.

Include scanning functions to:

~ advance the DFA at most N characters (or until failure)
~ advance the DFA to the next non-nil state label (or failure)

In both cases, give a way for lisp programs to get back not only
the label (or failure indication) but also the regular expression
continuation.

Those features are handy so that (for example) lisp programs can
hang a suspended regexp continuation on a buffer character as
a property, doing incremental "re-lexing" in application-specific
ways.

The "advance to non-nil label" feature is useful for writing lisp
programs that *do not* need back-referencing or sub-exp locations
per se.

It is a bit more speculative but also consider functions to:

~ advance the state of a DFA based on characters provided
   in a function call rather than read from a buffer -- e.g., a
   buffer position should not have to be part of the state of a
   running DFA.  
       (advance-dfa re-continuation chr) => re-continuation

Why that last one?  Because then you can probably use the same
DFA engine as the heart of a shift-reduce parser and (for languages
that admit such things) write an incremental parser.  (You'd be using
non-buffer-position DFAs to process token ids emitted by the lexer.)
You can also use such a feature for things like serial I/O protocols.

Incremental parsers open the door to robust "syntax directed editing"
which I think could be an exciting direction for IDE features to take.
(Years ago, Thomas Reps and Tim Teitelbaum worked on the "Synthesizer
Generator" which I recall had features along these lines (their parser
guts were probably different from what I suggest).  As I (now vaguely)
recall there is a book that talks about their Emacs-based implementation.)

Bye.  Thanks.  And good luck!
-t

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-12 15:55                       ` Thomas Lord
@ 2008-05-12 16:18                         ` tomas
  0 siblings, 0 replies; 30+ messages in thread
From: tomas @ 2008-05-12 16:18 UTC (permalink / raw)
  To: Thomas Lord
  Cc: Chong Yidong, 192, emacs-devel, martin rudalics, David Koppelman,
	Stefan Monnier, Bruno Haible

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, May 12, 2008 at 08:55:14AM -0700, Thomas Lord wrote:

[...]

> Why that last one?  Because then you can probably use the same
> DFA engine as the heart of a shift-reduce parser and (for languages
> that admit such things) write an incremental parser.  (You'd be using
> non-buffer-position DFAs to process token ids emitted by the lexer.)
> You can also use such a feature for things like serial I/O protocols.

...and cool things could be done with process-filter-function and its
cousin after-insert-file-functions (i.e. parse input on-the-fly). Nifty
stuff.

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFIKG3jBcgs9XrR2kYRAtv8AJ9uj1wEjjT4bIPNQxYoYY5iPJW8cwCdE87U
vsVarzdJhCu143kN7OGWh/Q=
=0dcf
-----END PGP SIGNATURE-----




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-12 15:03                     ` David Koppelman
@ 2008-05-12 16:29                       ` Stefan Monnier
  2008-05-12 17:04                         ` David Koppelman
  0 siblings, 1 reply; 30+ messages in thread
From: Stefan Monnier @ 2008-05-12 16:29 UTC (permalink / raw)
  To: David Koppelman
  Cc: martin rudalics, Chong Yidong, 192, Bruno Haible, emacs-devel

> I'm proposing that font-lock divide keywords into two or three
> classes, ordinary, multi-line, and maybe mega-line, matches for
> multi-line and mega-line keywords would be over much larger
> regions. Here is how it might work with two classes (keep in mind that
> I don't yet have a thorough understanding of font-lock and jit-lock):

I do not understand how you propose to solve the main problem:
Let's say you want to fontify a line spanning chars 100..200 and
a multiline region spanning 0..400.  Before fontifying, you need
to unfontify.  The region 100..200 can be completely unfontified, but
what about 0..99 and 201..400?  You can't unfontify them completely
since you don't want to refontify them completely either, so you'd need
to figure out which part of the fontification comes from the
multiline keywords.

Also, the order between keywords is important, so unless you force all
multiline keywords to go at the very end, you'd also need to remove (on
the 0..99 and 201..400 regions) the fontification coming from small
keywords that were placed after multiline keywords and reapply
it afterwards?

        Stefan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: regexp does not work as documented
  2008-05-12 16:29                       ` Stefan Monnier
@ 2008-05-12 17:04                         ` David Koppelman
  0 siblings, 0 replies; 30+ messages in thread
From: David Koppelman @ 2008-05-12 17:04 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: martin rudalics, Chong Yidong, 192, Bruno Haible, emacs-devel

> a multiline region spanning 0..400.  Before fontifying, you need
> to unfontify.  The region 100..200 can be completely unfontified, but

Hadn't thought about that. I don't want things to get too elaborate but
it would be nice to have guaranteed behavior below some multi-line size
and not risk slow behavior.

One possibility is to retain the code as it is, except have
extend-region-multiline extend to some maximum size (say, 100 lines)
with the expectation that the larger region would be used for deferred
fontification (I guess jit-lock does that). The only difference with
current operation is that the font-lock-multiline property is ignored
both ensuring proper matches (when the property is not present but a
pattern would match) and avoiding huge sized regions.

Now, if we wanted really large multi-line matches we could unfontify the
larger region but use a window+margin sized region (accounting for all
buffers visiting the file) for the regular patterns and then mark the
other parts of the larger region as unfontified. This would force
re-applying the multi-line patterns on buffer motion, though
we could cache the match data to avoid re-seaching.

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> I'm proposing that font-lock divide keywords into two or three
>> classes, ordinary, multi-line, and maybe mega-line, matches for
>> multi-line and mega-line keywords would be over much larger
>> regions. Here is how it might work with two classes (keep in mind that
>> I don't yet have a thorough understanding of font-lock and jit-lock):
>
> I do not understand how you propose to solve the main problem:
> Let's say you want to fontify a line spanning chars 100..200 and
> a multiline region spanning 0..400.  Before fontifying, you need
> to unfontify.  The region 100..200 can be completely unfontified, but
> what about 0..99 and 201..400?  You can't unfontify them completely
> since you don't want to refontify them completely either, so you'd need
> to figure out which part of the fontification comes from the
> multiline keywords.
>
> Also, the order between keywords is important, so unless you force all
> multiline keywords to go at the very end, you'd also need to remove (on
> the 0..99 and 201..400 regions) the fontification coming from small
> keywords that were placed after multiline keywords and reapply
> it afterwards?
>
>
>         Stefan

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2008-05-12 17:04 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-06  4:20 regexp does not work as documented Chong Yidong
2008-05-06 11:35 ` Bruno Haible
2008-05-06 12:12   ` martin rudalics
2008-05-10 19:18     ` David Koppelman
2008-05-10 20:13       ` David Koppelman
2008-05-11  7:40         ` martin rudalics
2008-05-11 14:27           ` Chong Yidong
2008-05-11 15:36             ` David Koppelman
2008-05-11 18:44               ` Stefan Monnier
2008-05-11 19:09                 ` David Koppelman
2008-05-12  1:28                   ` Stefan Monnier
2008-05-12 15:03                     ` David Koppelman
2008-05-12 16:29                       ` Stefan Monnier
2008-05-12 17:04                         ` David Koppelman
2008-05-11 18:44             ` Stefan Monnier
2008-05-11 20:03               ` Thomas Lord
2008-05-12  1:43                 ` Stefan Monnier
2008-05-12  3:30                   ` Thomas Lord
2008-05-12 13:43                     ` Stefan Monnier
2008-05-12 15:55                       ` Thomas Lord
2008-05-12 16:18                         ` tomas
2008-05-06 15:35   ` Stefan Monnier
2008-05-06 21:29     ` Bruno Haible
2008-05-10 20:04     ` Bruno Haible
2008-05-06 15:00 ` David Koppelman
2008-05-06 21:35   ` Bruno Haible
2008-05-07  1:04     ` Stefan Monnier
2008-05-07  1:08     ` Auto-discovery of multi-line font-lock regexps Stefan Monnier
2008-05-07  3:46       ` Chong Yidong
2008-05-07  4:21         ` Stefan Monnier

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).