* Re: regexp does not work as documented
@ 2008-05-06 4:20 Chong Yidong
2008-05-06 11:35 ` Bruno Haible
2008-05-06 15:00 ` David Koppelman
0 siblings, 2 replies; 30+ messages in thread
From: Chong Yidong @ 2008-05-06 4:20 UTC (permalink / raw)
To: emacs-devel; +Cc: koppel, 192, Bruno Haible
> $ emacs fr.po
> M-x highlight-regexp
>
> Regexp to highlight (enter this with Ctrl-Q Ctrl-J for each of the two
> newlines):
>
> ^msgstr\(\[[0-9]\]\)?.*
> \(".*
> \)*
>
> Highlight using face: hi-yellow
>
> Then move around in the buffer and look which lines are highlighted.
> In the first match already, only 5 out of 11 lines are highlighted.
I believe this bug arises because highlight-regexp uses font-lock to
highlight the regular expression, and the font-lock engine is
intentionally limiting the region to search for the multi-line regular
expression.
OTOH, I don't see what we can do about this problem. Maybe we could add
a note to the docstring of highlight-regexp saying that multi-line
regular expressions are problematic? Does anyone have a suggestion?
BTW, here is a simplified recipe, for those who didn't download the
attached file:
1. Copy the following text, between the "---...----" lines, into a
buffer
------------------
# Messages français pour GNU gettext.
# Copyright © 2006 Free Software Foundation, Inc.
# François Pinard <pinard@iro.umontreal.ca>, 1996.
#
#
msgid ""
msgstr ""
"Project-Id-Version: GNU gettext-tools 0.16.2-pre5\n"
"Report-Msgid-Bugs-To: bug-gnu-gettext@gnu.org\n"
"POT-Creation-Date: 2007-11-02 03:23+0100\n"
"PO-Revision-Date: 2007-10-27 13:35+0200\n"
"Last-Translator: Christophe Combelles <ccomb@free.fr>\n"
"Language-Team: French <traduc@traduc.org>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=(n > 1);\n"
------------------
2. M-: (highlight-regexp "^m.*\n\\(\".*\n\\)+") RET
Note that the last two lines remain unhighlighted.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-06 4:20 regexp does not work as documented Chong Yidong @ 2008-05-06 11:35 ` Bruno Haible 2008-05-06 12:12 ` martin rudalics 2008-05-06 15:35 ` Stefan Monnier 2008-05-06 15:00 ` David Koppelman 1 sibling, 2 replies; 30+ messages in thread From: Bruno Haible @ 2008-05-06 11:35 UTC (permalink / raw) To: Chong Yidong, emacs-devel; +Cc: koppel, 192 Chong Yidong wrote: > BTW, here is a simplified recipe, for those who didn't download the > attached file: > > 1. Copy the following text, between the "---...----" lines, into a > buffer > > ------------------ > # Messages français pour GNU gettext. > # Copyright © 2006 Free Software Foundation, Inc. > # François Pinard <pinard@iro.umontreal.ca>, 1996. > # > # > msgid "" > msgstr "" > "Project-Id-Version: GNU gettext-tools 0.16.2-pre5\n" > "Report-Msgid-Bugs-To: bug-gnu-gettext@gnu.org\n" > "POT-Creation-Date: 2007-11-02 03:23+0100\n" > "PO-Revision-Date: 2007-10-27 13:35+0200\n" > "Last-Translator: Christophe Combelles <ccomb@free.fr>\n" > "Language-Team: French <traduc@traduc.org>\n" > "MIME-Version: 1.0\n" > "Content-Type: text/plain; charset=UTF-8\n" > "Content-Transfer-Encoding: 8bit\n" > "Plural-Forms: nplurals=2; plural=(n > 1);\n" > ------------------ > > 2. M-: (highlight-regexp "^m.*\n\\(\".*\n\\)+") RET > > Note that the last two lines remain unhighlighted. Yes. I reproduce with this simpler recipe as well. Thank you. > I believe this bug arises because highlight-regexp uses font-lock to > highlight the regular expression, and the font-lock engine is > intentionally limiting the region to search for the multi-line regular > expression. You are right that there is a limit, but it is set to 200000: highlight-regexp is aliased to hi-lock-face-buffer, which asks for the arguments and calls hi-lock-set-pattern. hi-lock-set-pattern does little more than applying a margin of 100000 and calling re-search-forward. I believe the origin of the bug is deeper, because - the limit of 100000 is way larger than the little snippet you posted, - I originally observed the bug in po-mode (part of GNU gettext), in a function po-find-span-of-entry which essentially only calls re-search-backward and re-search-forward. > OTOH, I don't see what we can do about this problem. Maybe we could add > a note to the docstring of highlight-regexp saying that multi-line > regular expressions are problematic? Can someone help me find a workaround, then? If not, I would have to give up maintaining po-mode as part of GNU gettext. Said function is central in Emacs po-mode (everything else relies on it), and if multi-line regular expressions don't work, I don't know how this function could be rewritten. Bruno ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-06 11:35 ` Bruno Haible @ 2008-05-06 12:12 ` martin rudalics 2008-05-10 19:18 ` David Koppelman 2008-05-06 15:35 ` Stefan Monnier 1 sibling, 1 reply; 30+ messages in thread From: martin rudalics @ 2008-05-06 12:12 UTC (permalink / raw) To: Bruno Haible; +Cc: Chong Yidong, 192, koppel, emacs-devel > Can someone help me find a workaround, then? If not, I would have to give up > maintaining po-mode as part of GNU gettext. Said function is central in > Emacs po-mode (everything else relies on it), and if multi-line regular > expressions don't work, I don't know how this function could be rewritten. Don't worry, Stefan will find the solution. First of all you will probably have to (setq font-lock-multiline t) in the respective buffer. This will _not_ always DTRT after a buffer modification, as, for example, in AAAA CCCC BBBB where AAAA stands for some old text previously matched by your regexp, CCCC for some new text inserted (or old text removed), and BBBB for some text which, after the change, is now matched by the regexp (or not matched any more): In this case BBBB will be wrongly highlighted now. Alan uses the notorious `font-lock-extend-jit-lock-region-after-change' function to handle this, but it's not immediately clear how to apply this here. If everything else fails you will have to refontify till `window-end' (I prefer using a timer for such refontifications). ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-06 12:12 ` martin rudalics @ 2008-05-10 19:18 ` David Koppelman 2008-05-10 20:13 ` David Koppelman 0 siblings, 1 reply; 30+ messages in thread From: David Koppelman @ 2008-05-10 19:18 UTC (permalink / raw) To: martin rudalics; +Cc: Chong Yidong, 192, Bruno Haible, emacs-devel I was able to reproduce the problem with Bruno Haible's testcase and font-lock-multiline t does "fix" it. However martin rudalics warns that font-lock-multiline won't work for all cases and provides an example idea (below). I can't get that to fail. That is, with font-lock-multiline t the text was correctly fontified (though after a pause). My realization of the example was to remove and then add the first quotation mark from one of the interior lines below (also tried with more quoted lines): msgstr "" "Project-Id-Version: GNU gettext-tools 0.16.2-pre5\n" "Report-Msgid-Bugs-To: bug-gnu-gettext@gnu.org\n" "POT-Creation-Date: 2007-11-02 03:23+0100\n" "PO-Revision-Date: 2007-10-27 13:35+0200\n" "Last-Translator: Christophe Combelles <ccomb@free.fr>\n" "Last-Translator: Christophe Combelles <ccomb@free.fr>\n" x The fix I'm contemplating would be to warn the user when a multi-line regexp was added interactively and font-lock-multiline was nil, and then perhaps to offer to set font-lock-multiline to t (or to not set it, or to stop asking). martin rudalics <rudalics@gmx.at> writes: >> Can someone help me find a workaround, then? If not, I would have to give up >> maintaining po-mode as part of GNU gettext. Said function is central in >> Emacs po-mode (everything else relies on it), and if multi-line regular >> expressions don't work, I don't know how this function could be rewritten. > > Don't worry, Stefan will find the solution. First of all you will > probably have to > > (setq font-lock-multiline t) > > in the respective buffer. This will _not_ always DTRT after a buffer > modification, as, for example, in > > AAAA > > CCCC > > BBBB > > where AAAA stands for some old text previously matched by your regexp, > CCCC for some new text inserted (or old text removed), and BBBB for some > text which, after the change, is now matched by the regexp (or not > matched any more): In this case BBBB will be wrongly highlighted now. > Alan uses the notorious > > `font-lock-extend-jit-lock-region-after-change' > > function to handle this, but it's not immediately clear how to apply > this here. If everything else fails you will have to refontify till > `window-end' (I prefer using a timer for such refontifications). ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-10 19:18 ` David Koppelman @ 2008-05-10 20:13 ` David Koppelman 2008-05-11 7:40 ` martin rudalics 0 siblings, 1 reply; 30+ messages in thread From: David Koppelman @ 2008-05-10 20:13 UTC (permalink / raw) To: martin rudalics; +Cc: Chong Yidong, 192, Bruno Haible, emacs-devel I *am* able to reproduce the font-lock-multiline limitation martin described if the buffer is in text mode. I had tried to reproduce it in emacs-lisp mode. First I'll work on the hi-lock warning as I described below, then I'll see about detecting and doing something helpful for additional situations where multi-line won't work. David Koppelman <koppel@ece.lsu.edu> writes: > I was able to reproduce the problem with Bruno Haible's testcase and > font-lock-multiline t does "fix" it. However martin rudalics warns that > font-lock-multiline won't work for all cases and provides an example > idea (below). I can't get that to fail. That is, with > font-lock-multiline t the text was correctly fontified (though after a > pause). My realization of the example was to remove and then add the > first quotation mark from one of the interior lines below (also > tried with more quoted lines): > > msgstr "" > "Project-Id-Version: GNU gettext-tools 0.16.2-pre5\n" > "Report-Msgid-Bugs-To: bug-gnu-gettext@gnu.org\n" > "POT-Creation-Date: 2007-11-02 03:23+0100\n" > "PO-Revision-Date: 2007-10-27 13:35+0200\n" > "Last-Translator: Christophe Combelles <ccomb@free.fr>\n" > "Last-Translator: Christophe Combelles <ccomb@free.fr>\n" > x > > The fix I'm contemplating would be to warn the user when a multi-line > regexp was added interactively and font-lock-multiline was nil, and then > perhaps to offer to set font-lock-multiline to t (or to not set it, or > to stop asking). ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-10 20:13 ` David Koppelman @ 2008-05-11 7:40 ` martin rudalics 2008-05-11 14:27 ` Chong Yidong 0 siblings, 1 reply; 30+ messages in thread From: martin rudalics @ 2008-05-11 7:40 UTC (permalink / raw) To: David Koppelman; +Cc: Chong Yidong, 192, Bruno Haible, emacs-devel > First I'll work on the hi-lock warning as I described below, then I'll > see about detecting and doing something helpful for additional > situations where multi-line won't work. Think of the following pathological case: Devise a regexp to highlight the first line of a buffer provided the buffer does not end with a newline. Doing this with `font-lock-multiline' hardly makes any sense. Maybe users should classify whether a regexp they use (1) doesn't match newlines - no `font-lock-multiline' needed, (2) match at most n newlines in which case you should tell font-lock to rescan from n lines before each buffer change (with large n the display engine will suffer noticeably, mainly because font-lock has to search for all other keywords as well), or (3) may match more than n newlines in which case you should use an idle timer to scan the entire buffer for any matches of such regexps and highlight them separately. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-11 7:40 ` martin rudalics @ 2008-05-11 14:27 ` Chong Yidong 2008-05-11 15:36 ` David Koppelman 2008-05-11 18:44 ` Stefan Monnier 0 siblings, 2 replies; 30+ messages in thread From: Chong Yidong @ 2008-05-11 14:27 UTC (permalink / raw) To: martin rudalics; +Cc: David Koppelman, 192, Bruno Haible, emacs-devel martin rudalics <rudalics@gmx.at> writes: >> First I'll work on the hi-lock warning as I described below, then I'll >> see about detecting and doing something helpful for additional >> situations where multi-line won't work. > > Think of the following pathological case: Devise a regexp to highlight > the first line of a buffer provided the buffer does not end with a > newline. Doing this with `font-lock-multiline' hardly makes any sense. Ideally, highlight-regexp should work automagically, instead of forcing users to do something extra to make their multi-line regexp work properly. The right way to do this is probably for hi-lock-mode to process the buffer initially, setting up text properties to make font-lock DTRT even for multi-line expressions. But that's a big job. As for making hi-lock-mode detect whether or not a regexp is multi-line, isn't that a computationally non-trivial problem? Maybe making hi-lock-mode turn on font-lock-multiline, while not foolproof, works often enough to be satisfactory. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-11 14:27 ` Chong Yidong @ 2008-05-11 15:36 ` David Koppelman 2008-05-11 18:44 ` Stefan Monnier 2008-05-11 18:44 ` Stefan Monnier 1 sibling, 1 reply; 30+ messages in thread From: David Koppelman @ 2008-05-11 15:36 UTC (permalink / raw) To: Chong Yidong; +Cc: martin rudalics, 192, Bruno Haible, emacs-devel I agree pretty much with everything Chong Yidong writes. I rather not bother the user with an additional question if I don't have to, the alternative would be a warning. My latest plan is to do what Chong Yidong suggests, setting up text properties so that font-lock DTRT, though it doesn't seem as hard as he suggests (I'm still in the naive enthusiasm stage). I tried adding the font-lock-multiline property to the face property list passed to font lock and that did the trick, even with the font-lock-multiline variable nil. I rather do that than turn on font-lock-multiline because I'm assuming that font-lock-multiline is set to nil in most cases for a good reason. Rather than perfectly distinguishing multi-line from single line patterns guessing would be good enough for hi-lock. I'm using the following regexp, "\\(\n.\\|\\\\W[*+]\\|\\\\[SC].[*+]\\|\\[\\^[^]]+\\][+*]\\)", which hopefully isn't too far from covering a large majority of interactively entered patterns. I actually thought about properly parsing the regexp, but the effort to do that could be spent on making multi-line patterns work properly, at least if they don't span too many lines. One more thing, multi-line regexp matches don't work properly even with font-lock-multiline t when jit-lock is being used in a buffer without syntactic fontification and using the default setting of jit-lock-contextually, setting it to t gets multi-line fontification to work. I plan to play around a bit more and come up with something, maybe today, maybe early this week. Chong Yidong <cyd@stupidchicken.com> writes: > Ideally, highlight-regexp should work automagically, instead of forcing > users to do something extra to make their multi-line regexp work > properly. The right way to do this is probably for hi-lock-mode to > process the buffer initially, setting up text properties to make > font-lock DTRT even for multi-line expressions. But that's a big job. > > As for making hi-lock-mode detect whether or not a regexp is multi-line, > isn't that a computationally non-trivial problem? > > Maybe making hi-lock-mode turn on font-lock-multiline, while not > foolproof, works often enough to be satisfactory. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-11 15:36 ` David Koppelman @ 2008-05-11 18:44 ` Stefan Monnier 2008-05-11 19:09 ` David Koppelman 0 siblings, 1 reply; 30+ messages in thread From: Stefan Monnier @ 2008-05-11 18:44 UTC (permalink / raw) To: David Koppelman Cc: martin rudalics, Chong Yidong, 192, Bruno Haible, emacs-devel > My latest plan is to do what Chong Yidong suggests, setting up text > properties so that font-lock DTRT, though it doesn't seem as hard as he > suggests (I'm still in the naive enthusiasm stage). Indeed, it shouldn't be that hard. > I tried adding the font-lock-multiline property to the face property > list passed to font lock and that did the trick, even with the > font-lock-multiline variable nil. That may not be enough. You'll probably want to do something like what smerge does: (while (re-search-forward <RE> nil t) (font-lock-fontify-region (match-beginning 0) (match-end 0))) this will find all the multiline elements. And the font-lock-multiline property you add will make sure that those that were found will not disappear accidentally because of some later refontification. > I rather do that than turn on font-lock-multiline because I'm assuming > that font-lock-multiline is set to nil in most cases for > a good reason. Setting the `font-lock-multiline' variable to t has a performance cost. > I actually thought about properly parsing the regexp, but the effort to > do that could be spent on making multi-line patterns work properly, at > least if they don't span too many lines. If someone wants that, I have a parser that takes a regexp and turns it into something like `rx' syntax. It uses my lex.el library (which takes an `rx'-like input syntax). > One more thing, multi-line regexp matches don't work properly even > with font-lock-multiline t when jit-lock is being used in a buffer > without syntactic fontification and using the default setting of > jit-lock-contextually, setting it to t gets multi-line fontification > to work. The `font-lock-multiline' variable only tells font-lock that if it ever bumps into a multiline element, it should mark it (with the font-lock-multiline property) so that it will not re-fontify it as a whole if it ever needs to refontify it. So it doesn't solve the problem of "how do I make sure that font-lock indeed finds the multiline element". Multiline elements can only be found when font-locking a large enough piece of text, which tends to only happen during the initial fontification, or during background or contextual refontification, or during an explicit call such as in the above while loop. Stefan ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-11 18:44 ` Stefan Monnier @ 2008-05-11 19:09 ` David Koppelman 2008-05-12 1:28 ` Stefan Monnier 0 siblings, 1 reply; 30+ messages in thread From: David Koppelman @ 2008-05-11 19:09 UTC (permalink / raw) To: Stefan Monnier Cc: martin rudalics, Chong Yidong, 192, Bruno Haible, emacs-devel I've decided against having hi-lock turn on font-lock-multiline or even apply font-lock-multiline text properties, too much potential to slow things down to a crawl when an unsuspecting user enters a regexp. If I understand things correctly, the font-lock-multiline property is used to extend a region to be fontified, a region to be used for *all* keywords. This would have disastrous effects when multi-line patterns span, say, 100's of lines for modes with hundreds of keywords. I had been toying with the idea of limiting extended regions to something like 100 lines, but that still seems wasteful when most keywords are single line (I haven't benchmarked anything yet). A better solution would be to have font-lock use multi-line extended regions selectively. Perhaps a hint in the current keyword syntax (say, explicitly applying the font-lock-multiline property), or a separate method for providing multi-line keywords to font-lock. Such keywords would get the multi-line extended regions, the other just the whole-line extensions (or whatever the hooks do). Is this something the font-lock maintainers would consider? What I'll do now is just document the limitations for hi-lock and perhaps provide a warning when a multiline pattern is used. > That may not be enough. You'll probably want to do something like what > smerge does: > > (while (re-search-forward <RE> nil t) > (font-lock-fontify-region (match-beginning 0) (match-end 0))) I wouldn't do that without suppressing other keywords. > If someone wants that, I have a parser that takes a regexp and turns it > into something like `rx' syntax. It uses my lex.el library (which > takes an `rx'-like input syntax). That sounds useful, either E-mail it to me or let me know where to find it. Stefan Monnier <monnier@iro.umontreal.ca> writes: >> My latest plan is to do what Chong Yidong suggests, setting up text >> properties so that font-lock DTRT, though it doesn't seem as hard as he >> suggests (I'm still in the naive enthusiasm stage). > > Indeed, it shouldn't be that hard. > >> I tried adding the font-lock-multiline property to the face property >> list passed to font lock and that did the trick, even with the >> font-lock-multiline variable nil. > > That may not be enough. You'll probably want to do something like what > smerge does: > > (while (re-search-forward <RE> nil t) > (font-lock-fontify-region (match-beginning 0) (match-end 0))) > > this will find all the multiline elements. And the font-lock-multiline > property you add will make sure that those that were found will not > disappear accidentally because of some later refontification. > >> I rather do that than turn on font-lock-multiline because I'm assuming >> that font-lock-multiline is set to nil in most cases for >> a good reason. > > Setting the `font-lock-multiline' variable to t has a performance cost. > >> I actually thought about properly parsing the regexp, but the effort to >> do that could be spent on making multi-line patterns work properly, at >> least if they don't span too many lines. > > If someone wants that, I have a parser that takes a regexp and turns it > into something like `rx' syntax. It uses my lex.el library (which > takes an `rx'-like input syntax). > >> One more thing, multi-line regexp matches don't work properly even >> with font-lock-multiline t when jit-lock is being used in a buffer >> without syntactic fontification and using the default setting of >> jit-lock-contextually, setting it to t gets multi-line fontification >> to work. > > The `font-lock-multiline' variable only tells font-lock that if it ever > bumps into a multiline element, it should mark it (with the > font-lock-multiline property) so that it will not re-fontify it as > a whole if it ever needs to refontify it. > > So it doesn't solve the problem of "how do I make sure that font-lock > indeed finds the multiline element". Multiline elements can only be > found when font-locking a large enough piece of text, which tends to > only happen during the initial fontification, or during background or > contextual refontification, or during an explicit call such as in the > above while loop. > > > Stefan ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-11 19:09 ` David Koppelman @ 2008-05-12 1:28 ` Stefan Monnier 2008-05-12 15:03 ` David Koppelman 0 siblings, 1 reply; 30+ messages in thread From: Stefan Monnier @ 2008-05-12 1:28 UTC (permalink / raw) To: David Koppelman Cc: martin rudalics, Chong Yidong, 192, Bruno Haible, emacs-devel [-- Attachment #1: Type: text/plain, Size: 1670 bytes --] > A better solution would be to have font-lock use multi-line extended > regions selectively. Perhaps a hint in the current keyword syntax > (say, explicitly applying the font-lock-multiline property), or a > separate method for providing multi-line keywords to font-lock. I don't understand the difference between the above and the application of font-lock-multiline properties which you seem to have tried and rejected. I don't necessarily disagree with your rejection of font-lock-multiline: it can have disastrous effect indeed if the multiline region becomes large. > Such keywords would get the multi-line extended regions, the other > just the whole-line extensions (or whatever the hooks do). > Is this something the font-lock maintainers would consider? I guess I simply do not understand what you propose. Any improvement in the multiline handling is welcome, but beware: this is not an easy area. >> (while (re-search-forward <RE> nil t) >> (font-lock-fontify-region (match-beginning 0) (match-end 0))) > I wouldn't do that without suppressing other keywords. FWIW, I do pretty much exactly the above loop in smerge-mode and I haven't heard complaints yet. >> If someone wants that, I have a parser that takes a regexp and turns it >> into something like `rx' syntax. It uses my lex.el library (which >> takes an `rx'-like input syntax). > That sounds useful, either E-mail it to me or let me know > where to find it. Find the current version attached. Consider it as 99.9% untested code, tho. Also you need to eval it before you can byte-compile it. And I strongly recommend you byte-compile it to reduce the specpdl usage. Stefan [-- Attachment #2: lex.el --] [-- Type: application/emacs-lisp, Size: 49609 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-12 1:28 ` Stefan Monnier @ 2008-05-12 15:03 ` David Koppelman 2008-05-12 16:29 ` Stefan Monnier 0 siblings, 1 reply; 30+ messages in thread From: David Koppelman @ 2008-05-12 15:03 UTC (permalink / raw) To: Stefan Monnier Cc: martin rudalics, Chong Yidong, 192, Bruno Haible, emacs-devel > I guess I simply do not understand what you propose. Any improvement in > the multiline handling is welcome, but beware: this is not an easy area. I'm proposing that font-lock divide keywords into two or three classes, ordinary, multi-line, and maybe mega-line, matches for multi-line and mega-line keywords would be over much larger regions. Here is how it might work with two classes (keep in mind that I don't yet have a thorough understanding of font-lock and jit-lock): Multi-line keywords are explicitly identified as such, perhaps through keyword syntax or the way they are given to font-lock (say, using font-lock-multiline-keywords). Explicit identification avoids performance problems from keywords that, though technically multi-line, rarely span more than a few lines. Functions such as font-lock-default-fontify-region would find two sets of extended regions, ordinary and multi, running functions on two hooks for this purpose. The multi-line hook might extend the region based on the size of the largest supported match rather than using the multline property. The multiline property might still be useful for non-deferred handling of existing matches. Functions such as font-lock-fontify-keywords-region would be passed both extended regions and use the region appropriate for each keyword they process. The large region is only used on the few multi-line patterns that need it. Here I'm assuming that a mode might have hundreds of single-line (or two-line) keywords and only a few multi-line keywords, and the multi-line keywords might span no more than hundreds of lines. We could guarantee that matches for such patterns are perfect (using a line-count-limit variable). If there were a third class, mega-line, it would have its own text property and region-extension hook. Stefan Monnier <monnier@iro.umontreal.ca> writes: >> A better solution would be to have font-lock use multi-line extended >> regions selectively. Perhaps a hint in the current keyword syntax >> (say, explicitly applying the font-lock-multiline property), or a >> separate method for providing multi-line keywords to font-lock. > > I don't understand the difference between the above and the application > of font-lock-multiline properties which you seem to have tried and rejected. > > I don't necessarily disagree with your rejection of font-lock-multiline: > it can have disastrous effect indeed if the multiline region becomes large. > >> Such keywords would get the multi-line extended regions, the other >> just the whole-line extensions (or whatever the hooks do). >> Is this something the font-lock maintainers would consider? > > I guess I simply do not understand what you propose. Any improvement in > the multiline handling is welcome, but beware: this is not an easy area. > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-12 15:03 ` David Koppelman @ 2008-05-12 16:29 ` Stefan Monnier 2008-05-12 17:04 ` David Koppelman 0 siblings, 1 reply; 30+ messages in thread From: Stefan Monnier @ 2008-05-12 16:29 UTC (permalink / raw) To: David Koppelman Cc: martin rudalics, Chong Yidong, 192, Bruno Haible, emacs-devel > I'm proposing that font-lock divide keywords into two or three > classes, ordinary, multi-line, and maybe mega-line, matches for > multi-line and mega-line keywords would be over much larger > regions. Here is how it might work with two classes (keep in mind that > I don't yet have a thorough understanding of font-lock and jit-lock): I do not understand how you propose to solve the main problem: Let's say you want to fontify a line spanning chars 100..200 and a multiline region spanning 0..400. Before fontifying, you need to unfontify. The region 100..200 can be completely unfontified, but what about 0..99 and 201..400? You can't unfontify them completely since you don't want to refontify them completely either, so you'd need to figure out which part of the fontification comes from the multiline keywords. Also, the order between keywords is important, so unless you force all multiline keywords to go at the very end, you'd also need to remove (on the 0..99 and 201..400 regions) the fontification coming from small keywords that were placed after multiline keywords and reapply it afterwards? Stefan ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-12 16:29 ` Stefan Monnier @ 2008-05-12 17:04 ` David Koppelman 0 siblings, 0 replies; 30+ messages in thread From: David Koppelman @ 2008-05-12 17:04 UTC (permalink / raw) To: Stefan Monnier Cc: martin rudalics, Chong Yidong, 192, Bruno Haible, emacs-devel > a multiline region spanning 0..400. Before fontifying, you need > to unfontify. The region 100..200 can be completely unfontified, but Hadn't thought about that. I don't want things to get too elaborate but it would be nice to have guaranteed behavior below some multi-line size and not risk slow behavior. One possibility is to retain the code as it is, except have extend-region-multiline extend to some maximum size (say, 100 lines) with the expectation that the larger region would be used for deferred fontification (I guess jit-lock does that). The only difference with current operation is that the font-lock-multiline property is ignored both ensuring proper matches (when the property is not present but a pattern would match) and avoiding huge sized regions. Now, if we wanted really large multi-line matches we could unfontify the larger region but use a window+margin sized region (accounting for all buffers visiting the file) for the regular patterns and then mark the other parts of the larger region as unfontified. This would force re-applying the multi-line patterns on buffer motion, though we could cache the match data to avoid re-seaching. Stefan Monnier <monnier@iro.umontreal.ca> writes: >> I'm proposing that font-lock divide keywords into two or three >> classes, ordinary, multi-line, and maybe mega-line, matches for >> multi-line and mega-line keywords would be over much larger >> regions. Here is how it might work with two classes (keep in mind that >> I don't yet have a thorough understanding of font-lock and jit-lock): > > I do not understand how you propose to solve the main problem: > Let's say you want to fontify a line spanning chars 100..200 and > a multiline region spanning 0..400. Before fontifying, you need > to unfontify. The region 100..200 can be completely unfontified, but > what about 0..99 and 201..400? You can't unfontify them completely > since you don't want to refontify them completely either, so you'd need > to figure out which part of the fontification comes from the > multiline keywords. > > Also, the order between keywords is important, so unless you force all > multiline keywords to go at the very end, you'd also need to remove (on > the 0..99 and 201..400 regions) the fontification coming from small > keywords that were placed after multiline keywords and reapply > it afterwards? > > > Stefan ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-11 14:27 ` Chong Yidong 2008-05-11 15:36 ` David Koppelman @ 2008-05-11 18:44 ` Stefan Monnier 2008-05-11 20:03 ` Thomas Lord 1 sibling, 1 reply; 30+ messages in thread From: Stefan Monnier @ 2008-05-11 18:44 UTC (permalink / raw) To: Chong Yidong Cc: martin rudalics, David Koppelman, 192, Bruno Haible, emacs-devel > As for making hi-lock-mode detect whether or not a regexp is multi-line, > isn't that a computationally non-trivial problem? Well, you can turn the regexp into a DFA, then take the ".*\n.+" regexp, turn it into another DFA, take the intersection of the two DFAs, and if it's empty you know your regexp can never match a multiline element. Stefan ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-11 18:44 ` Stefan Monnier @ 2008-05-11 20:03 ` Thomas Lord 2008-05-12 1:43 ` Stefan Monnier 0 siblings, 1 reply; 30+ messages in thread From: Thomas Lord @ 2008-05-11 20:03 UTC (permalink / raw) To: Stefan Monnier Cc: Chong Yidong, 192, emacs-devel, martin rudalics, David Koppelman, Bruno Haible [-- Attachment #1: Type: text/plain, Size: 2842 bytes --] >> As for making hi-lock-mode detect whether or not a regexp is multi-line, >> isn't that a computationally non-trivial problem? >> > > Well, you can turn the regexp into a DFA, then take the ".*\n.+" regexp, > turn it into another DFA, take the intersection of the two DFAs, and if > it's empty you know your regexp can never match a multiline element. > If you are going to go that that trouble, perhaps there is a better solution: The Rx pattern matcher found in distributions of GNU Arch has these relevant capabilities (relevant at least so far as I understand the problem you are trying to solve): 1. It does on-the-fly regexp->DFA conversion, degrading gracefully into mixed NFA/DFA mode or pure NFA mode if the DFA would be too large. The calling program gets to say what "too large" is. 2. Although it is a C library, you can capture what is (in essence) the continuation of an on-going match. That is, you can suspend a match (or scan) part-way through, then later resume from that point, perhaps multiple times. (This does not involve abusing the C stack.) 3. It does have some Unicode support in there and, though these capabilities are under-tested and some features are missing, it is quite flexible about encoding forms. 4. The DFA construction is "caching" and, for a given regexp, all uses will share the DFA construction. E.g., multiple, suspended regexp continuations can be space efficient because they will share state. 5. Because of the caching and structure sharing, you can tell if two continuations from a single regexp have arrived at the same state with a C EQ test ("=="). How can this help? Well, instead of using heuristics to decide where to re-scan from and too, you can cache a record of where the DFA scan arrived at for periodic positions in the buffer. Then begin scanning from just before any modification for as far as it takes to arrive at a DFA state that is the same as last time, updating any highlighting in the region between those two points. I don't mean to imply that this is a trivial thing to implement in Emacs but if you start getting up to building DFAs (very expensive in the worst case) and taking intersections (very expensive in the worst case) -- both also not all that simple to implement (nor obviously possible for Emacs' extended regexp language) -- then the effort may be comparable and (re-)visiting the option to adapt Rx to Emacs should be worth considering. As a point of amusement and credit where due, I think it was Jim Blandy who first noticed this possibility in the early 1990s when I was explaining to him the capabilities I was then just beginning to add to Rx. This is a very old problem, long recognized, with some work already done on a (purportedly) Right Thing solution. -t [-- Attachment #2: Type: text/html, Size: 3655 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-11 20:03 ` Thomas Lord @ 2008-05-12 1:43 ` Stefan Monnier 2008-05-12 3:30 ` Thomas Lord 0 siblings, 1 reply; 30+ messages in thread From: Stefan Monnier @ 2008-05-12 1:43 UTC (permalink / raw) To: Thomas Lord Cc: Chong Yidong, 192, emacs-devel, martin rudalics, David Koppelman, Bruno Haible > Well, instead of using heuristics to decide where to re-scan from and > too, you can cache a record of where the DFA scan arrived at for > periodic positions in the buffer. Then begin scanning from just > before any modification for as far as it takes to arrive at a DFA > state that is the same as last time, updating any highlighting in the > region between those two points. That's a very good point. I'm not sure it's worth the trouble to store it at various buffer positions and check if it's EQ to stop the rescan, but at least we could match multiline expression one-line at a time. In any case, it's indeed a non-trivial amount of work because it probably requires rewriting not just font-lock but all the foo-mode-font-lock-keywords as well (font-lock-keywords are order dependent so you can't apply the rule nb 3 after rule nb 4). > I don't mean to imply that this is a trivial thing to implement in > Emacs but if you start getting up to building DFAs (very expensive in > the worst case) and taking intersections (very expensive in the worst > case) -- both also not all that simple to implement (nor obviously > possible for Emacs' extended regexp language) -- then the effort may > be comparable and (re-)visiting the option to adapt Rx to Emacs should > be worth considering. I have most of the DFA construction code written, but I may take you up on that anyway. BTW, regarding the "very expensive in the worst case", how common is this worst case in real life? Stefan ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-12 1:43 ` Stefan Monnier @ 2008-05-12 3:30 ` Thomas Lord 2008-05-12 13:43 ` Stefan Monnier 0 siblings, 1 reply; 30+ messages in thread From: Thomas Lord @ 2008-05-12 3:30 UTC (permalink / raw) To: Stefan Monnier Cc: Chong Yidong, 192, emacs-devel, martin rudalics, David Koppelman, Bruno Haible Stefan Monnier wrote: > I have most of the DFA construction code written, but I may take you up > on that anyway. BTW, regarding the "very expensive in the worst case", > how common is this worst case in real life? > Talking about general purpose use of an engine, as opposed to something more narrow like "likely font-lock expressions": It's common enough in "real life" (in my experience) to be exactly the minimal amount required to make it annoying. Let me give you some rules of thumb to describe what I mean but I'll refrain from a long treatise explaining the theory that supports these. If you have questions I can "unpack" this on list or off. First, be careful to make a clear distinction between (let's dub the distinction) "regexps vs. regular expressions". "Regular expressions" correspond to formally regular languages. Regular expressions can always be converted to a DFA. "regexps" are what most people use in most situations. regexps include things like features to extract the positions that match a sub-pattern. regexps do *not* correspond to regular languages and *can not* always be converted to DFA. DFA conversion can help with regexp matching, but it can't solve it completely. Moreover, you can make pathological regexps (very slow to match) for every regexp I know of. Henry Spencer (I've heard) asserts that regexps are NP complete or something around there, though I haven't seen the proof. (Rx is a hybrid regular expression and regexp engine based on an on-line (caching, incremental) DFA conversion suggested in the "Dragon" compiler book and originating from (as I recall) an experiment by Thompson and one of the other Bell Labs guys. One of whoever it was privately mentioned that they abandoned it themselves because it was taking too much code to implement, or something like that.) That aside, regular expressions are probably plenty for font-lock functionality, so let's just talk about those. On-line DFA conversion for those is likely practical by multiple means. The size of a DFA is, worst-case, exponential in the size of the regular expression. This is the source of pathological cases that can thwart any DFA engine. (In exponential cases, Rx keeps running but more like an NFA, taking a corresponding performance hit.) For a time, I was pushing hard on Rx, trying to use it for absolutely everything. To make up new problems to solve as in "Ok, let's suppose I can run X-Kbyte long regular expressions with lots of | operators, stars, etc. What can I apply that to?" One example is lexing for real-world computer languages. [Aside: you can also use the same engine as the core of a shift-reduce parser, of course -- and I did that a long time ago with Guile, to useful effect.] Almost always, the expressions were such that Rx would have no problem and give very pleasing results. However, there were definitely times (for huge regular expressions and smaller ones) when things would just absolutely crawl. They happen "just often enough" to be an annoyance. Now, every case of that annoyance that I found (in real-life applications) had a solution. I could think for a while on *why* the regular expression I wrote was blowing up and then think of a different approach that eliminated the problem. Most often that meant not a different but equivalent regular expression -- it meant going one level up and changing the way the caller used regular expressions. The caller would retain the same functionality but different demands would be made on the regular expression (or regexp) engine. And that makes it tricky to drop *blindly* into Emacs. To solve the unavoidable annoying cases you really had to know how regular expressions worked at a deep level. To diagnose and work around the pathological cases took some expertise. As a general rule, for something "general purpose" like a libc implementation or the default regexp functions in Emacs lisp, there is something to be said for using NFA matchers. It's perverse why: Vast swaths of things that a DFA matcher can do very quickly an NFA matcher can not. It's reliably slow. Therefore, people not prepared to think hard about regexps tend to use an NFA matcher in pretty limited ways. A powerful regular expression engine could simplify their code, at the cost of risking finding a "hard case" -- but there's no issue since people quickly give up and don't try to push the match engine that hard. More concisely: If you use a DFA matcher You Better Know What You're Doing but also Most of the Time You Won't Need To So It's Very Convenient. If you use an NFA matcher You Can Get Away With Not Really Knowing What You're Doing but also You Won't Be Trying Anything Fancy, Perhaps Losing Out. One approach to toss into the mix, this one suggested by Per Bothner years ago, is to consider *offline* DFA conversion (a la 'lex(1)'). The advantage of offline (batch) conversion is that you can burn a lot of cycles on DFA minimization and, if your offline converter terminates, you've got a reliably linear matcher. The disadvantages for *many* uses of regular expressions in Emacs should be pretty obvious. For something like font-lock, where the regular expressions don't change that often, that might be a good approach -- precompile a minimal DFA and then add support for "regular expression continuations" when using those tables. -t ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-12 3:30 ` Thomas Lord @ 2008-05-12 13:43 ` Stefan Monnier 2008-05-12 15:55 ` Thomas Lord 0 siblings, 1 reply; 30+ messages in thread From: Stefan Monnier @ 2008-05-12 13:43 UTC (permalink / raw) To: Thomas Lord Cc: Chong Yidong, 192, emacs-devel, martin rudalics, David Koppelman, Bruno Haible > years ago, is to consider *offline* DFA conversion (a la 'lex(1)'). That's what I do in lex.el. > The advantage of offline (batch) conversion is that you can burn a lot > of cycles on DFA minimization and, if your offline converter > terminates, you've got a reliably linear matcher. The disadvantages > for *many* uses of regular expressions in Emacs should be pretty > obvious. For something like font-lock, where the regular expressions > don't change that often, that might be a good approach -- precompile > a minimal DFA and then add support for "regular expression > continuations" when using those tables. I do not intend to replace src/regexp.c with a matcher based on offline DFA conversion. Actually, the need to support backrefs makes it pretty much impossible (tho I'm sure there's a way to adapt an offline DFA so it can be used with backrefs), and most importantly it has too different performance characteristics. More specifically, the compilation step should be made explicit. In any case I think you did answer my question: an offline DFA matcher is fine, the worst case is not that common and can be worked around. This is not that different from the current backtracking matcher. Stefan PS: The original motivation for a DFA-matcher is to extend syntax-tables so they can match match multi-char elements. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-12 13:43 ` Stefan Monnier @ 2008-05-12 15:55 ` Thomas Lord 2008-05-12 16:18 ` tomas 0 siblings, 1 reply; 30+ messages in thread From: Thomas Lord @ 2008-05-12 15:55 UTC (permalink / raw) To: Stefan Monnier Cc: Chong Yidong, 192, emacs-devel, martin rudalics, David Koppelman, Bruno Haible Stefan Monnier wrote: > That's what I do in lex.el. > > Sounds nice. Last bits of experience report, then: If it isn't so already, it may be easy to make it so that a choice of which DFA is being used, plus a choice of the "current state" can be represented as lisp objects and cheaply copied. That gives the essence of "regular expression continuations". Handy features that shouldn't be difficult to add (if not present): Let programmers specify "labels" for each NFA state and then, for each DFA state, have either a list of all NFA labels that correspond to that DFA state and/or a more general way to "combine" NFA state labels to make the DFA label. You can wind up with many NFA states combined to a single DFA state, of course, so a "combine" function might be important. Include scanning functions to: ~ advance the DFA at most N characters (or until failure) ~ advance the DFA to the next non-nil state label (or failure) In both cases, give a way for lisp programs to get back not only the label (or failure indication) but also the regular expression continuation. Those features are handy so that (for example) lisp programs can hang a suspended regexp continuation on a buffer character as a property, doing incremental "re-lexing" in application-specific ways. The "advance to non-nil label" feature is useful for writing lisp programs that *do not* need back-referencing or sub-exp locations per se. It is a bit more speculative but also consider functions to: ~ advance the state of a DFA based on characters provided in a function call rather than read from a buffer -- e.g., a buffer position should not have to be part of the state of a running DFA. (advance-dfa re-continuation chr) => re-continuation Why that last one? Because then you can probably use the same DFA engine as the heart of a shift-reduce parser and (for languages that admit such things) write an incremental parser. (You'd be using non-buffer-position DFAs to process token ids emitted by the lexer.) You can also use such a feature for things like serial I/O protocols. Incremental parsers open the door to robust "syntax directed editing" which I think could be an exciting direction for IDE features to take. (Years ago, Thomas Reps and Tim Teitelbaum worked on the "Synthesizer Generator" which I recall had features along these lines (their parser guts were probably different from what I suggest). As I (now vaguely) recall there is a book that talks about their Emacs-based implementation.) Bye. Thanks. And good luck! -t ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-12 15:55 ` Thomas Lord @ 2008-05-12 16:18 ` tomas 0 siblings, 0 replies; 30+ messages in thread From: tomas @ 2008-05-12 16:18 UTC (permalink / raw) To: Thomas Lord Cc: Chong Yidong, 192, emacs-devel, martin rudalics, David Koppelman, Stefan Monnier, Bruno Haible -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Mon, May 12, 2008 at 08:55:14AM -0700, Thomas Lord wrote: [...] > Why that last one? Because then you can probably use the same > DFA engine as the heart of a shift-reduce parser and (for languages > that admit such things) write an incremental parser. (You'd be using > non-buffer-position DFAs to process token ids emitted by the lexer.) > You can also use such a feature for things like serial I/O protocols. ...and cool things could be done with process-filter-function and its cousin after-insert-file-functions (i.e. parse input on-the-fly). Nifty stuff. Regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFIKG3jBcgs9XrR2kYRAtv8AJ9uj1wEjjT4bIPNQxYoYY5iPJW8cwCdE87U vsVarzdJhCu143kN7OGWh/Q= =0dcf -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-06 11:35 ` Bruno Haible 2008-05-06 12:12 ` martin rudalics @ 2008-05-06 15:35 ` Stefan Monnier 2008-05-06 21:29 ` Bruno Haible 2008-05-10 20:04 ` Bruno Haible 1 sibling, 2 replies; 30+ messages in thread From: Stefan Monnier @ 2008-05-06 15:35 UTC (permalink / raw) To: Bruno Haible; +Cc: Chong Yidong, 192, koppel, emacs-devel > You are right that there is a limit, but it is set to 200000: > highlight-regexp is aliased to hi-lock-face-buffer, which asks for the > arguments and calls hi-lock-set-pattern. hi-lock-set-pattern does little > more than applying a margin of 100000 and calling re-search-forward. Actually, font-lock-fontified is most likely set to t, so hi-lock-set-pattern doesn't call re-sarch-forward at all and only calls font-lock-fontify-buffer instead. > - I originally observed the bug in po-mode (part of GNU gettext), in > a function po-find-span-of-entry which essentially only calls > re-search-backward and re-search-forward. Please try and reproduce the problem there and send us a recipe. Stefan ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-06 15:35 ` Stefan Monnier @ 2008-05-06 21:29 ` Bruno Haible 2008-05-10 20:04 ` Bruno Haible 1 sibling, 0 replies; 30+ messages in thread From: Bruno Haible @ 2008-05-06 21:29 UTC (permalink / raw) To: Stefan Monnier; +Cc: Chong Yidong, 192, koppel, emacs-devel Stefan Monnier wrote: > Actually, font-lock-fontified is most likely set to t, so > hi-lock-set-pattern doesn't call re-sarch-forward at all and only calls > font-lock-fontify-buffer instead. You're right: If I do a M-x evaluate-expression (setq font-lock-fontify-buffer nil) before M-x highlight-regexp the result is correct. So the problem is indeed with the font-locking. Bruno ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-06 15:35 ` Stefan Monnier 2008-05-06 21:29 ` Bruno Haible @ 2008-05-10 20:04 ` Bruno Haible 1 sibling, 0 replies; 30+ messages in thread From: Bruno Haible @ 2008-05-10 20:04 UTC (permalink / raw) To: Stefan Monnier; +Cc: Chong Yidong, 192, koppel, emacs-devel Stefan Monnier wrote: > > - I originally observed the bug in po-mode (part of GNU gettext), in > > a function po-find-span-of-entry which essentially only calls > > re-search-backward and re-search-forward. > > Please try and reproduce the problem there and send us a recipe. After deeper investigation, the bug in po-mode was not directly related: I had used a regexp which was not designed for use with re-search-backward. This bug is closed now. <https://savannah.gnu.org/bugs/?23177> Thank you Stefan for bringing me on the right track. Bruno ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-06 4:20 regexp does not work as documented Chong Yidong 2008-05-06 11:35 ` Bruno Haible @ 2008-05-06 15:00 ` David Koppelman 2008-05-06 21:35 ` Bruno Haible 1 sibling, 1 reply; 30+ messages in thread From: David Koppelman @ 2008-05-06 15:00 UTC (permalink / raw) To: Chong Yidong; +Cc: 192, Bruno Haible, emacs-devel Later in the week I'll look into it and provide either a fix or document the limitation. Chong Yidong <cyd@stupidchicken.com> writes: > OTOH, I don't see what we can do about this problem. Maybe we could add > a note to the docstring of highlight-regexp saying that multi-line > regular expressions are problematic? Does anyone have a suggestion? Chong Yidong <cyd@stupidchicken.com> writes: >> $ emacs fr.po >> M-x highlight-regexp >> >> Regexp to highlight (enter this with Ctrl-Q Ctrl-J for each of the two >> newlines): >> >> ^msgstr\(\[[0-9]\]\)?.* >> \(".* >> \)* >> >> Highlight using face: hi-yellow >> >> Then move around in the buffer and look which lines are highlighted. >> In the first match already, only 5 out of 11 lines are highlighted. > > I believe this bug arises because highlight-regexp uses font-lock to > highlight the regular expression, and the font-lock engine is > intentionally limiting the region to search for the multi-line regular > expression. > > OTOH, I don't see what we can do about this problem. Maybe we could add > a note to the docstring of highlight-regexp saying that multi-line > regular expressions are problematic? Does anyone have a suggestion? > > > BTW, here is a simplified recipe, for those who didn't download the > attached file: > > 1. Copy the following text, between the "---...----" lines, into a > buffer > > ------------------ > # Messages français pour GNU gettext. > # Copyright © 2006 Free Software Foundation, Inc. > # François Pinard <pinard@iro.umontreal.ca>, 1996. > # > # > msgid "" > msgstr "" > "Project-Id-Version: GNU gettext-tools 0.16.2-pre5\n" > "Report-Msgid-Bugs-To: bug-gnu-gettext@gnu.org\n" > "POT-Creation-Date: 2007-11-02 03:23+0100\n" > "PO-Revision-Date: 2007-10-27 13:35+0200\n" > "Last-Translator: Christophe Combelles <ccomb@free.fr>\n" > "Language-Team: French <traduc@traduc.org>\n" > "MIME-Version: 1.0\n" > "Content-Type: text/plain; charset=UTF-8\n" > "Content-Transfer-Encoding: 8bit\n" > "Plural-Forms: nplurals=2; plural=(n > 1);\n" > ------------------ > > 2. M-: (highlight-regexp "^m.*\n\\(\".*\n\\)+") RET > > Note that the last two lines remain unhighlighted. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-06 15:00 ` David Koppelman @ 2008-05-06 21:35 ` Bruno Haible 2008-05-07 1:04 ` Stefan Monnier 2008-05-07 1:08 ` Auto-discovery of multi-line font-lock regexps Stefan Monnier 0 siblings, 2 replies; 30+ messages in thread From: Bruno Haible @ 2008-05-06 21:35 UTC (permalink / raw) To: David Koppelman, Chong Yidong; +Cc: 192, emacs-devel David Koppelman wrote: > Later in the week I'll look into it and provide either a fix > or document the limitation. Thank you! Chong Yidong wrote: > > OTOH, I don't see what we can do about this problem. Maybe we could add > > a note to the docstring of highlight-regexp saying that multi-line > > regular expressions are problematic? Does anyone have a suggestion? As an end user, for testing the effect of a regexp on a buffer interactively, I would prefer to have a "volatile" coloring (i.e. one that disappears at the next buffer modification) but is correct, rather than a documented-to-be-wrong coloring that updates itself correctly during buffer modifications. Less functionality but implemented correctly. OTOH, third-party packages may prefer the current behaviour if their regexps match only portions of a line. Bruno ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: regexp does not work as documented 2008-05-06 21:35 ` Bruno Haible @ 2008-05-07 1:04 ` Stefan Monnier 2008-05-07 1:08 ` Auto-discovery of multi-line font-lock regexps Stefan Monnier 1 sibling, 0 replies; 30+ messages in thread From: Stefan Monnier @ 2008-05-07 1:04 UTC (permalink / raw) To: Bruno Haible; +Cc: David Koppelman, 192, Chong Yidong, emacs-devel > As an end user, for testing the effect of a regexp on a buffer interactively, > I would prefer to have a "volatile" coloring (i.e. one that disappears at the > next buffer modification) but is correct, rather than a documented-to-be-wrong > coloring that updates itself correctly during buffer modifications. Less > functionality but implemented correctly. Actually, we can get the combination of the two. hilight-changes (c|sh)ould use its own loop with re-search-forward, even when font-lock is enabled. This way the highlighting would be initially correct, and in some cases it would also be correctly preserved/discovered later on. This would only be used for regexp that can span multiple lines, so highlight-changes (c|sh)ould analyse the regexp to see if there's a possibility for it to match multiple lines. Stefan ^ permalink raw reply [flat|nested] 30+ messages in thread
* Auto-discovery of multi-line font-lock regexps 2008-05-06 21:35 ` Bruno Haible 2008-05-07 1:04 ` Stefan Monnier @ 2008-05-07 1:08 ` Stefan Monnier 2008-05-07 3:46 ` Chong Yidong 1 sibling, 1 reply; 30+ messages in thread From: Stefan Monnier @ 2008-05-07 1:08 UTC (permalink / raw) To: emacs-devel While reading the "regexp does not work as documented" thread, an idea came to me: we could have an idle task that takes the font-lock regexps and instead of applying them directly, only uses them to find matches that span multiple lines and mark them with the `font-lock-multiline' property. The idea is that the font-lock-multiline works OK to preserve multiline matches, so the real difficulty is in making sure we discover them correctly. Sometimes we do by happenstance, sometimes we do because the major-mode was careful to make it work (which is far from trivial), but often we just don't. Having such a background loop would be very helpful. Its job can easily be stopped at any time, so it shouldn't introduce long latencies. Stefan ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Auto-discovery of multi-line font-lock regexps 2008-05-07 1:08 ` Auto-discovery of multi-line font-lock regexps Stefan Monnier @ 2008-05-07 3:46 ` Chong Yidong 2008-05-07 4:21 ` Stefan Monnier 0 siblings, 1 reply; 30+ messages in thread From: Chong Yidong @ 2008-05-07 3:46 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel Stefan Monnier <monnier@iro.umontreal.ca> writes: > While reading the "regexp does not work as documented" thread, an idea > came to me: we could have an idle task that takes the font-lock regexps > and instead of applying them directly, only uses them to find matches > that span multiple lines and mark them with the > `font-lock-multiline' property. > > The idea is that the font-lock-multiline works OK to preserve multiline > matches, so the real difficulty is in making sure we discover > them correctly. Sometimes we do by happenstance, sometimes we do > because the major-mode was careful to make it work (which is far from > trivial), but often we just don't. Having such a background loop would > be very helpful. Its job can easily be stopped at any time, so it > shouldn't introduce long latencies. Sounds like a good idea, but wouldn't that run into the same problem with the JIT lock stealth timer that necessitated setting jit-lock-stealth-time to nil (i.e., people on laptops complaining about Emacs eating CPU)? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Auto-discovery of multi-line font-lock regexps 2008-05-07 3:46 ` Chong Yidong @ 2008-05-07 4:21 ` Stefan Monnier 0 siblings, 0 replies; 30+ messages in thread From: Stefan Monnier @ 2008-05-07 4:21 UTC (permalink / raw) To: Chong Yidong; +Cc: emacs-devel > Sounds like a good idea, but wouldn't that run into the same problem > with the JIT lock stealth timer that necessitated setting > jit-lock-stealth-time to nil (i.e., people on laptops complaining about > Emacs eating CPU)? Of course. Except it would actually make a difference w.r.t behavior rather than just performance. I expect it would only be enabled in some particular buffers where it proves necessary. Maybe the problematic regexps could be specially labelled in font-lock-keywords so that only relevant regexps get this kind of treatment. Stefan ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2008-05-12 17:04 UTC | newest] Thread overview: 30+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-05-06 4:20 regexp does not work as documented Chong Yidong 2008-05-06 11:35 ` Bruno Haible 2008-05-06 12:12 ` martin rudalics 2008-05-10 19:18 ` David Koppelman 2008-05-10 20:13 ` David Koppelman 2008-05-11 7:40 ` martin rudalics 2008-05-11 14:27 ` Chong Yidong 2008-05-11 15:36 ` David Koppelman 2008-05-11 18:44 ` Stefan Monnier 2008-05-11 19:09 ` David Koppelman 2008-05-12 1:28 ` Stefan Monnier 2008-05-12 15:03 ` David Koppelman 2008-05-12 16:29 ` Stefan Monnier 2008-05-12 17:04 ` David Koppelman 2008-05-11 18:44 ` Stefan Monnier 2008-05-11 20:03 ` Thomas Lord 2008-05-12 1:43 ` Stefan Monnier 2008-05-12 3:30 ` Thomas Lord 2008-05-12 13:43 ` Stefan Monnier 2008-05-12 15:55 ` Thomas Lord 2008-05-12 16:18 ` tomas 2008-05-06 15:35 ` Stefan Monnier 2008-05-06 21:29 ` Bruno Haible 2008-05-10 20:04 ` Bruno Haible 2008-05-06 15:00 ` David Koppelman 2008-05-06 21:35 ` Bruno Haible 2008-05-07 1:04 ` Stefan Monnier 2008-05-07 1:08 ` Auto-discovery of multi-line font-lock regexps Stefan Monnier 2008-05-07 3:46 ` Chong Yidong 2008-05-07 4:21 ` Stefan Monnier
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).