* bug#192: regexp does not work as documented [not found] ` <48204B3D.6000500@gmx.at> @ 2008-05-10 19:18 ` David Koppelman [not found] ` <yg5skwqc6ho.fsf@nested.ece.lsu.edu> 1 sibling, 0 replies; 20+ messages in thread From: David Koppelman @ 2008-05-10 19:18 UTC (permalink / raw) To: martin rudalics; +Cc: Chong Yidong, 192, Bruno Haible, emacs-devel I was able to reproduce the problem with Bruno Haible's testcase and font-lock-multiline t does "fix" it. However martin rudalics warns that font-lock-multiline won't work for all cases and provides an example idea (below). I can't get that to fail. That is, with font-lock-multiline t the text was correctly fontified (though after a pause). My realization of the example was to remove and then add the first quotation mark from one of the interior lines below (also tried with more quoted lines): msgstr "" "Project-Id-Version: GNU gettext-tools 0.16.2-pre5\n" "Report-Msgid-Bugs-To: bug-gnu-gettext@gnu.org\n" "POT-Creation-Date: 2007-11-02 03:23+0100\n" "PO-Revision-Date: 2007-10-27 13:35+0200\n" "Last-Translator: Christophe Combelles <ccomb@free.fr>\n" "Last-Translator: Christophe Combelles <ccomb@free.fr>\n" x The fix I'm contemplating would be to warn the user when a multi-line regexp was added interactively and font-lock-multiline was nil, and then perhaps to offer to set font-lock-multiline to t (or to not set it, or to stop asking). martin rudalics <rudalics@gmx.at> writes: >> Can someone help me find a workaround, then? If not, I would have to give up >> maintaining po-mode as part of GNU gettext. Said function is central in >> Emacs po-mode (everything else relies on it), and if multi-line regular >> expressions don't work, I don't know how this function could be rewritten. > > Don't worry, Stefan will find the solution. First of all you will > probably have to > > (setq font-lock-multiline t) > > in the respective buffer. This will _not_ always DTRT after a buffer > modification, as, for example, in > > AAAA > > CCCC > > BBBB > > where AAAA stands for some old text previously matched by your regexp, > CCCC for some new text inserted (or old text removed), and BBBB for some > text which, after the change, is now matched by the regexp (or not > matched any more): In this case BBBB will be wrongly highlighted now. > Alan uses the notorious > > `font-lock-extend-jit-lock-region-after-change' > > function to handle this, but it's not immediately clear how to apply > this here. If everything else fails you will have to refontify till > `window-end' (I prefer using a timer for such refontifications). ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <yg5skwqc6ho.fsf@nested.ece.lsu.edu>]
* bug#192: regexp does not work as documented [not found] ` <yg5skwqc6ho.fsf@nested.ece.lsu.edu> @ 2008-05-10 20:13 ` David Koppelman [not found] ` <yg5bq3ddij2.fsf@nested.ece.lsu.edu> 1 sibling, 0 replies; 20+ messages in thread From: David Koppelman @ 2008-05-10 20:13 UTC (permalink / raw) To: martin rudalics; +Cc: Chong Yidong, 192, Bruno Haible, emacs-devel I *am* able to reproduce the font-lock-multiline limitation martin described if the buffer is in text mode. I had tried to reproduce it in emacs-lisp mode. First I'll work on the hi-lock warning as I described below, then I'll see about detecting and doing something helpful for additional situations where multi-line won't work. David Koppelman <koppel@ece.lsu.edu> writes: > I was able to reproduce the problem with Bruno Haible's testcase and > font-lock-multiline t does "fix" it. However martin rudalics warns that > font-lock-multiline won't work for all cases and provides an example > idea (below). I can't get that to fail. That is, with > font-lock-multiline t the text was correctly fontified (though after a > pause). My realization of the example was to remove and then add the > first quotation mark from one of the interior lines below (also > tried with more quoted lines): > > msgstr "" > "Project-Id-Version: GNU gettext-tools 0.16.2-pre5\n" > "Report-Msgid-Bugs-To: bug-gnu-gettext@gnu.org\n" > "POT-Creation-Date: 2007-11-02 03:23+0100\n" > "PO-Revision-Date: 2007-10-27 13:35+0200\n" > "Last-Translator: Christophe Combelles <ccomb@free.fr>\n" > "Last-Translator: Christophe Combelles <ccomb@free.fr>\n" > x > > The fix I'm contemplating would be to warn the user when a multi-line > regexp was added interactively and font-lock-multiline was nil, and then > perhaps to offer to set font-lock-multiline to t (or to not set it, or > to stop asking). ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <yg5bq3ddij2.fsf@nested.ece.lsu.edu>]
* bug#192: regexp does not work as documented [not found] ` <yg5bq3ddij2.fsf@nested.ece.lsu.edu> @ 2008-05-11 7:40 ` martin rudalics [not found] ` <4826A303.3030002@gmx.at> 1 sibling, 0 replies; 20+ messages in thread From: martin rudalics @ 2008-05-11 7:40 UTC (permalink / raw) To: David Koppelman; +Cc: Chong Yidong, 192, Bruno Haible, emacs-devel > First I'll work on the hi-lock warning as I described below, then I'll > see about detecting and doing something helpful for additional > situations where multi-line won't work. Think of the following pathological case: Devise a regexp to highlight the first line of a buffer provided the buffer does not end with a newline. Doing this with `font-lock-multiline' hardly makes any sense. Maybe users should classify whether a regexp they use (1) doesn't match newlines - no `font-lock-multiline' needed, (2) match at most n newlines in which case you should tell font-lock to rescan from n lines before each buffer change (with large n the display engine will suffer noticeably, mainly because font-lock has to search for all other keywords as well), or (3) may match more than n newlines in which case you should use an idle timer to scan the entire buffer for any matches of such regexps and highlight them separately. ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <4826A303.3030002@gmx.at>]
* bug#192: regexp does not work as documented [not found] ` <4826A303.3030002@gmx.at> @ 2008-05-11 14:27 ` Chong Yidong [not found] ` <87abiwoqzd.fsf@stupidchicken.com> 1 sibling, 0 replies; 20+ messages in thread From: Chong Yidong @ 2008-05-11 14:27 UTC (permalink / raw) To: martin rudalics; +Cc: David Koppelman, 192, Bruno Haible, emacs-devel martin rudalics <rudalics@gmx.at> writes: >> First I'll work on the hi-lock warning as I described below, then I'll >> see about detecting and doing something helpful for additional >> situations where multi-line won't work. > > Think of the following pathological case: Devise a regexp to highlight > the first line of a buffer provided the buffer does not end with a > newline. Doing this with `font-lock-multiline' hardly makes any sense. Ideally, highlight-regexp should work automagically, instead of forcing users to do something extra to make their multi-line regexp work properly. The right way to do this is probably for hi-lock-mode to process the buffer initially, setting up text properties to make font-lock DTRT even for multi-line expressions. But that's a big job. As for making hi-lock-mode detect whether or not a regexp is multi-line, isn't that a computationally non-trivial problem? Maybe making hi-lock-mode turn on font-lock-multiline, while not foolproof, works often enough to be satisfactory. ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <87abiwoqzd.fsf@stupidchicken.com>]
* bug#192: regexp does not work as documented [not found] ` <87abiwoqzd.fsf@stupidchicken.com> @ 2008-05-11 15:36 ` David Koppelman [not found] ` <yg57ie0df8u.fsf@nested.ece.lsu.edu> ` (2 subsequent siblings) 3 siblings, 0 replies; 20+ messages in thread From: David Koppelman @ 2008-05-11 15:36 UTC (permalink / raw) To: Chong Yidong; +Cc: 192, Bruno Haible, emacs-devel I agree pretty much with everything Chong Yidong writes. I rather not bother the user with an additional question if I don't have to, the alternative would be a warning. My latest plan is to do what Chong Yidong suggests, setting up text properties so that font-lock DTRT, though it doesn't seem as hard as he suggests (I'm still in the naive enthusiasm stage). I tried adding the font-lock-multiline property to the face property list passed to font lock and that did the trick, even with the font-lock-multiline variable nil. I rather do that than turn on font-lock-multiline because I'm assuming that font-lock-multiline is set to nil in most cases for a good reason. Rather than perfectly distinguishing multi-line from single line patterns guessing would be good enough for hi-lock. I'm using the following regexp, "\\(\n.\\|\\\\W[*+]\\|\\\\[SC].[*+]\\|\\[\\^[^]]+\\][+*]\\)", which hopefully isn't too far from covering a large majority of interactively entered patterns. I actually thought about properly parsing the regexp, but the effort to do that could be spent on making multi-line patterns work properly, at least if they don't span too many lines. One more thing, multi-line regexp matches don't work properly even with font-lock-multiline t when jit-lock is being used in a buffer without syntactic fontification and using the default setting of jit-lock-contextually, setting it to t gets multi-line fontification to work. I plan to play around a bit more and come up with something, maybe today, maybe early this week. Chong Yidong <cyd@stupidchicken.com> writes: > Ideally, highlight-regexp should work automagically, instead of forcing > users to do something extra to make their multi-line regexp work > properly. The right way to do this is probably for hi-lock-mode to > process the buffer initially, setting up text properties to make > font-lock DTRT even for multi-line expressions. But that's a big job. > > As for making hi-lock-mode detect whether or not a regexp is multi-line, > isn't that a computationally non-trivial problem? > > Maybe making hi-lock-mode turn on font-lock-multiline, while not > foolproof, works often enough to be satisfactory. ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <yg57ie0df8u.fsf@nested.ece.lsu.edu>]
* bug#192: regexp does not work as documented [not found] ` <yg57ie0df8u.fsf@nested.ece.lsu.edu> @ 2008-05-11 18:44 ` Stefan Monnier [not found] ` <jwv4p94r8vp.fsf-monnier+emacsbugreports@gnu.org> 1 sibling, 0 replies; 20+ messages in thread From: Stefan Monnier @ 2008-05-11 18:44 UTC (permalink / raw) To: David Koppelman; +Cc: Chong Yidong, 192, Bruno Haible, emacs-devel > My latest plan is to do what Chong Yidong suggests, setting up text > properties so that font-lock DTRT, though it doesn't seem as hard as he > suggests (I'm still in the naive enthusiasm stage). Indeed, it shouldn't be that hard. > I tried adding the font-lock-multiline property to the face property > list passed to font lock and that did the trick, even with the > font-lock-multiline variable nil. That may not be enough. You'll probably want to do something like what smerge does: (while (re-search-forward <RE> nil t) (font-lock-fontify-region (match-beginning 0) (match-end 0))) this will find all the multiline elements. And the font-lock-multiline property you add will make sure that those that were found will not disappear accidentally because of some later refontification. > I rather do that than turn on font-lock-multiline because I'm assuming > that font-lock-multiline is set to nil in most cases for > a good reason. Setting the `font-lock-multiline' variable to t has a performance cost. > I actually thought about properly parsing the regexp, but the effort to > do that could be spent on making multi-line patterns work properly, at > least if they don't span too many lines. If someone wants that, I have a parser that takes a regexp and turns it into something like `rx' syntax. It uses my lex.el library (which takes an `rx'-like input syntax). > One more thing, multi-line regexp matches don't work properly even > with font-lock-multiline t when jit-lock is being used in a buffer > without syntactic fontification and using the default setting of > jit-lock-contextually, setting it to t gets multi-line fontification > to work. The `font-lock-multiline' variable only tells font-lock that if it ever bumps into a multiline element, it should mark it (with the font-lock-multiline property) so that it will not re-fontify it as a whole if it ever needs to refontify it. So it doesn't solve the problem of "how do I make sure that font-lock indeed finds the multiline element". Multiline elements can only be found when font-locking a large enough piece of text, which tends to only happen during the initial fontification, or during background or contextual refontification, or during an explicit call such as in the above while loop. Stefan ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <jwv4p94r8vp.fsf-monnier+emacsbugreports@gnu.org>]
* bug#192: regexp does not work as documented [not found] ` <jwv4p94r8vp.fsf-monnier+emacsbugreports@gnu.org> @ 2008-05-11 19:09 ` David Koppelman [not found] ` <yg5tzh4bqtw.fsf@nested.ece.lsu.edu> 1 sibling, 0 replies; 20+ messages in thread From: David Koppelman @ 2008-05-11 19:09 UTC (permalink / raw) To: Stefan Monnier; +Cc: Chong Yidong, 192, Bruno Haible, emacs-devel I've decided against having hi-lock turn on font-lock-multiline or even apply font-lock-multiline text properties, too much potential to slow things down to a crawl when an unsuspecting user enters a regexp. If I understand things correctly, the font-lock-multiline property is used to extend a region to be fontified, a region to be used for *all* keywords. This would have disastrous effects when multi-line patterns span, say, 100's of lines for modes with hundreds of keywords. I had been toying with the idea of limiting extended regions to something like 100 lines, but that still seems wasteful when most keywords are single line (I haven't benchmarked anything yet). A better solution would be to have font-lock use multi-line extended regions selectively. Perhaps a hint in the current keyword syntax (say, explicitly applying the font-lock-multiline property), or a separate method for providing multi-line keywords to font-lock. Such keywords would get the multi-line extended regions, the other just the whole-line extensions (or whatever the hooks do). Is this something the font-lock maintainers would consider? What I'll do now is just document the limitations for hi-lock and perhaps provide a warning when a multiline pattern is used. > That may not be enough. You'll probably want to do something like what > smerge does: > > (while (re-search-forward <RE> nil t) > (font-lock-fontify-region (match-beginning 0) (match-end 0))) I wouldn't do that without suppressing other keywords. > If someone wants that, I have a parser that takes a regexp and turns it > into something like `rx' syntax. It uses my lex.el library (which > takes an `rx'-like input syntax). That sounds useful, either E-mail it to me or let me know where to find it. Stefan Monnier <monnier@iro.umontreal.ca> writes: >> My latest plan is to do what Chong Yidong suggests, setting up text >> properties so that font-lock DTRT, though it doesn't seem as hard as he >> suggests (I'm still in the naive enthusiasm stage). > > Indeed, it shouldn't be that hard. > >> I tried adding the font-lock-multiline property to the face property >> list passed to font lock and that did the trick, even with the >> font-lock-multiline variable nil. > > That may not be enough. You'll probably want to do something like what > smerge does: > > (while (re-search-forward <RE> nil t) > (font-lock-fontify-region (match-beginning 0) (match-end 0))) > > this will find all the multiline elements. And the font-lock-multiline > property you add will make sure that those that were found will not > disappear accidentally because of some later refontification. > >> I rather do that than turn on font-lock-multiline because I'm assuming >> that font-lock-multiline is set to nil in most cases for >> a good reason. > > Setting the `font-lock-multiline' variable to t has a performance cost. > >> I actually thought about properly parsing the regexp, but the effort to >> do that could be spent on making multi-line patterns work properly, at >> least if they don't span too many lines. > > If someone wants that, I have a parser that takes a regexp and turns it > into something like `rx' syntax. It uses my lex.el library (which > takes an `rx'-like input syntax). > >> One more thing, multi-line regexp matches don't work properly even >> with font-lock-multiline t when jit-lock is being used in a buffer >> without syntactic fontification and using the default setting of >> jit-lock-contextually, setting it to t gets multi-line fontification >> to work. > > The `font-lock-multiline' variable only tells font-lock that if it ever > bumps into a multiline element, it should mark it (with the > font-lock-multiline property) so that it will not re-fontify it as > a whole if it ever needs to refontify it. > > So it doesn't solve the problem of "how do I make sure that font-lock > indeed finds the multiline element". Multiline elements can only be > found when font-locking a large enough piece of text, which tends to > only happen during the initial fontification, or during background or > contextual refontification, or during an explicit call such as in the > above while loop. > > > Stefan ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <yg5tzh4bqtw.fsf@nested.ece.lsu.edu>]
* bug#192: regexp does not work as documented [not found] ` <yg5tzh4bqtw.fsf@nested.ece.lsu.edu> @ 2008-05-12 1:28 ` Stefan Monnier [not found] ` <jwvr6c8pbd6.fsf-monnier+emacsbugreports@gnu.org> 1 sibling, 0 replies; 20+ messages in thread From: Stefan Monnier @ 2008-05-12 1:28 UTC (permalink / raw) To: David Koppelman; +Cc: Chong Yidong, 192, Bruno Haible, emacs-devel [-- Attachment #1: Type: text/plain, Size: 1670 bytes --] > A better solution would be to have font-lock use multi-line extended > regions selectively. Perhaps a hint in the current keyword syntax > (say, explicitly applying the font-lock-multiline property), or a > separate method for providing multi-line keywords to font-lock. I don't understand the difference between the above and the application of font-lock-multiline properties which you seem to have tried and rejected. I don't necessarily disagree with your rejection of font-lock-multiline: it can have disastrous effect indeed if the multiline region becomes large. > Such keywords would get the multi-line extended regions, the other > just the whole-line extensions (or whatever the hooks do). > Is this something the font-lock maintainers would consider? I guess I simply do not understand what you propose. Any improvement in the multiline handling is welcome, but beware: this is not an easy area. >> (while (re-search-forward <RE> nil t) >> (font-lock-fontify-region (match-beginning 0) (match-end 0))) > I wouldn't do that without suppressing other keywords. FWIW, I do pretty much exactly the above loop in smerge-mode and I haven't heard complaints yet. >> If someone wants that, I have a parser that takes a regexp and turns it >> into something like `rx' syntax. It uses my lex.el library (which >> takes an `rx'-like input syntax). > That sounds useful, either E-mail it to me or let me know > where to find it. Find the current version attached. Consider it as 99.9% untested code, tho. Also you need to eval it before you can byte-compile it. And I strongly recommend you byte-compile it to reduce the specpdl usage. Stefan [-- Attachment #2: lex.el --] [-- Type: application/emacs-lisp, Size: 49609 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <jwvr6c8pbd6.fsf-monnier+emacsbugreports@gnu.org>]
* bug#192: regexp does not work as documented [not found] ` <jwvr6c8pbd6.fsf-monnier+emacsbugreports@gnu.org> @ 2008-05-12 15:03 ` David Koppelman [not found] ` <yg5d4nra7jb.fsf@nested.ece.lsu.edu> 1 sibling, 0 replies; 20+ messages in thread From: David Koppelman @ 2008-05-12 15:03 UTC (permalink / raw) To: Stefan Monnier; +Cc: Chong Yidong, 192, Bruno Haible, emacs-devel > I guess I simply do not understand what you propose. Any improvement in > the multiline handling is welcome, but beware: this is not an easy area. I'm proposing that font-lock divide keywords into two or three classes, ordinary, multi-line, and maybe mega-line, matches for multi-line and mega-line keywords would be over much larger regions. Here is how it might work with two classes (keep in mind that I don't yet have a thorough understanding of font-lock and jit-lock): Multi-line keywords are explicitly identified as such, perhaps through keyword syntax or the way they are given to font-lock (say, using font-lock-multiline-keywords). Explicit identification avoids performance problems from keywords that, though technically multi-line, rarely span more than a few lines. Functions such as font-lock-default-fontify-region would find two sets of extended regions, ordinary and multi, running functions on two hooks for this purpose. The multi-line hook might extend the region based on the size of the largest supported match rather than using the multline property. The multiline property might still be useful for non-deferred handling of existing matches. Functions such as font-lock-fontify-keywords-region would be passed both extended regions and use the region appropriate for each keyword they process. The large region is only used on the few multi-line patterns that need it. Here I'm assuming that a mode might have hundreds of single-line (or two-line) keywords and only a few multi-line keywords, and the multi-line keywords might span no more than hundreds of lines. We could guarantee that matches for such patterns are perfect (using a line-count-limit variable). If there were a third class, mega-line, it would have its own text property and region-extension hook. Stefan Monnier <monnier@iro.umontreal.ca> writes: >> A better solution would be to have font-lock use multi-line extended >> regions selectively. Perhaps a hint in the current keyword syntax >> (say, explicitly applying the font-lock-multiline property), or a >> separate method for providing multi-line keywords to font-lock. > > I don't understand the difference between the above and the application > of font-lock-multiline properties which you seem to have tried and rejected. > > I don't necessarily disagree with your rejection of font-lock-multiline: > it can have disastrous effect indeed if the multiline region becomes large. > >> Such keywords would get the multi-line extended regions, the other >> just the whole-line extensions (or whatever the hooks do). >> Is this something the font-lock maintainers would consider? > > I guess I simply do not understand what you propose. Any improvement in > the multiline handling is welcome, but beware: this is not an easy area. > ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <yg5d4nra7jb.fsf@nested.ece.lsu.edu>]
* bug#192: regexp does not work as documented [not found] ` <yg5d4nra7jb.fsf@nested.ece.lsu.edu> @ 2008-05-12 16:29 ` Stefan Monnier [not found] ` <jwvzlqvmr6g.fsf-monnier+emacsbugreports@gnu.org> 1 sibling, 0 replies; 20+ messages in thread From: Stefan Monnier @ 2008-05-12 16:29 UTC (permalink / raw) To: David Koppelman; +Cc: Chong Yidong, 192, Bruno Haible, emacs-devel > I'm proposing that font-lock divide keywords into two or three > classes, ordinary, multi-line, and maybe mega-line, matches for > multi-line and mega-line keywords would be over much larger > regions. Here is how it might work with two classes (keep in mind that > I don't yet have a thorough understanding of font-lock and jit-lock): I do not understand how you propose to solve the main problem: Let's say you want to fontify a line spanning chars 100..200 and a multiline region spanning 0..400. Before fontifying, you need to unfontify. The region 100..200 can be completely unfontified, but what about 0..99 and 201..400? You can't unfontify them completely since you don't want to refontify them completely either, so you'd need to figure out which part of the fontification comes from the multiline keywords. Also, the order between keywords is important, so unless you force all multiline keywords to go at the very end, you'd also need to remove (on the 0..99 and 201..400 regions) the fontification coming from small keywords that were placed after multiline keywords and reapply it afterwards? Stefan ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <jwvzlqvmr6g.fsf-monnier+emacsbugreports@gnu.org>]
* bug#192: regexp does not work as documented [not found] ` <jwvzlqvmr6g.fsf-monnier+emacsbugreports@gnu.org> @ 2008-05-12 17:04 ` David Koppelman 0 siblings, 0 replies; 20+ messages in thread From: David Koppelman @ 2008-05-12 17:04 UTC (permalink / raw) To: Stefan Monnier; +Cc: Chong Yidong, 192, Bruno Haible, emacs-devel > a multiline region spanning 0..400. Before fontifying, you need > to unfontify. The region 100..200 can be completely unfontified, but Hadn't thought about that. I don't want things to get too elaborate but it would be nice to have guaranteed behavior below some multi-line size and not risk slow behavior. One possibility is to retain the code as it is, except have extend-region-multiline extend to some maximum size (say, 100 lines) with the expectation that the larger region would be used for deferred fontification (I guess jit-lock does that). The only difference with current operation is that the font-lock-multiline property is ignored both ensuring proper matches (when the property is not present but a pattern would match) and avoiding huge sized regions. Now, if we wanted really large multi-line matches we could unfontify the larger region but use a window+margin sized region (accounting for all buffers visiting the file) for the regular patterns and then mark the other parts of the larger region as unfontified. This would force re-applying the multi-line patterns on buffer motion, though we could cache the match data to avoid re-seaching. Stefan Monnier <monnier@iro.umontreal.ca> writes: >> I'm proposing that font-lock divide keywords into two or three >> classes, ordinary, multi-line, and maybe mega-line, matches for >> multi-line and mega-line keywords would be over much larger >> regions. Here is how it might work with two classes (keep in mind that >> I don't yet have a thorough understanding of font-lock and jit-lock): > > I do not understand how you propose to solve the main problem: > Let's say you want to fontify a line spanning chars 100..200 and > a multiline region spanning 0..400. Before fontifying, you need > to unfontify. The region 100..200 can be completely unfontified, but > what about 0..99 and 201..400? You can't unfontify them completely > since you don't want to refontify them completely either, so you'd need > to figure out which part of the fontification comes from the > multiline keywords. > > Also, the order between keywords is important, so unless you force all > multiline keywords to go at the very end, you'd also need to remove (on > the 0..99 and 201..400 regions) the fontification coming from small > keywords that were placed after multiline keywords and reapply > it afterwards? > > > Stefan ^ permalink raw reply [flat|nested] 20+ messages in thread
* bug#192: regexp does not work as documented [not found] ` <87abiwoqzd.fsf@stupidchicken.com> 2008-05-11 15:36 ` David Koppelman [not found] ` <yg57ie0df8u.fsf@nested.ece.lsu.edu> @ 2008-05-11 18:44 ` Stefan Monnier [not found] ` <jwv8wygrbss.fsf-monnier+emacsbugreports@gnu.org> 3 siblings, 0 replies; 20+ messages in thread From: Stefan Monnier @ 2008-05-11 18:44 UTC (permalink / raw) To: Chong Yidong; +Cc: David Koppelman, 192, Bruno Haible, emacs-devel > As for making hi-lock-mode detect whether or not a regexp is multi-line, > isn't that a computationally non-trivial problem? Well, you can turn the regexp into a DFA, then take the ".*\n.+" regexp, turn it into another DFA, take the intersection of the two DFAs, and if it's empty you know your regexp can never match a multiline element. Stefan ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <jwv8wygrbss.fsf-monnier+emacsbugreports@gnu.org>]
* bug#192: regexp does not work as documented [not found] ` <jwv8wygrbss.fsf-monnier+emacsbugreports@gnu.org> @ 2008-05-11 20:03 ` Thomas Lord [not found] ` <482750F4.2050102@emf.net> 1 sibling, 0 replies; 20+ messages in thread From: Thomas Lord @ 2008-05-11 20:03 UTC (permalink / raw) To: Stefan Monnier Cc: Chong Yidong, 192, emacs-devel, David Koppelman, Bruno Haible [-- Attachment #1: Type: text/plain, Size: 2842 bytes --] >> As for making hi-lock-mode detect whether or not a regexp is multi-line, >> isn't that a computationally non-trivial problem? >> > > Well, you can turn the regexp into a DFA, then take the ".*\n.+" regexp, > turn it into another DFA, take the intersection of the two DFAs, and if > it's empty you know your regexp can never match a multiline element. > If you are going to go that that trouble, perhaps there is a better solution: The Rx pattern matcher found in distributions of GNU Arch has these relevant capabilities (relevant at least so far as I understand the problem you are trying to solve): 1. It does on-the-fly regexp->DFA conversion, degrading gracefully into mixed NFA/DFA mode or pure NFA mode if the DFA would be too large. The calling program gets to say what "too large" is. 2. Although it is a C library, you can capture what is (in essence) the continuation of an on-going match. That is, you can suspend a match (or scan) part-way through, then later resume from that point, perhaps multiple times. (This does not involve abusing the C stack.) 3. It does have some Unicode support in there and, though these capabilities are under-tested and some features are missing, it is quite flexible about encoding forms. 4. The DFA construction is "caching" and, for a given regexp, all uses will share the DFA construction. E.g., multiple, suspended regexp continuations can be space efficient because they will share state. 5. Because of the caching and structure sharing, you can tell if two continuations from a single regexp have arrived at the same state with a C EQ test ("=="). How can this help? Well, instead of using heuristics to decide where to re-scan from and too, you can cache a record of where the DFA scan arrived at for periodic positions in the buffer. Then begin scanning from just before any modification for as far as it takes to arrive at a DFA state that is the same as last time, updating any highlighting in the region between those two points. I don't mean to imply that this is a trivial thing to implement in Emacs but if you start getting up to building DFAs (very expensive in the worst case) and taking intersections (very expensive in the worst case) -- both also not all that simple to implement (nor obviously possible for Emacs' extended regexp language) -- then the effort may be comparable and (re-)visiting the option to adapt Rx to Emacs should be worth considering. As a point of amusement and credit where due, I think it was Jim Blandy who first noticed this possibility in the early 1990s when I was explaining to him the capabilities I was then just beginning to add to Rx. This is a very old problem, long recognized, with some work already done on a (purportedly) Right Thing solution. -t [-- Attachment #2: Type: text/html, Size: 3655 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <482750F4.2050102@emf.net>]
* bug#192: regexp does not work as documented [not found] ` <482750F4.2050102@emf.net> @ 2008-05-12 1:43 ` Stefan Monnier [not found] ` <jwvlk2gpas3.fsf-monnier+emacsbugreports@gnu.org> 1 sibling, 0 replies; 20+ messages in thread From: Stefan Monnier @ 2008-05-12 1:43 UTC (permalink / raw) To: Thomas Lord; +Cc: Chong Yidong, 192, emacs-devel, David Koppelman, Bruno Haible > Well, instead of using heuristics to decide where to re-scan from and > too, you can cache a record of where the DFA scan arrived at for > periodic positions in the buffer. Then begin scanning from just > before any modification for as far as it takes to arrive at a DFA > state that is the same as last time, updating any highlighting in the > region between those two points. That's a very good point. I'm not sure it's worth the trouble to store it at various buffer positions and check if it's EQ to stop the rescan, but at least we could match multiline expression one-line at a time. In any case, it's indeed a non-trivial amount of work because it probably requires rewriting not just font-lock but all the foo-mode-font-lock-keywords as well (font-lock-keywords are order dependent so you can't apply the rule nb 3 after rule nb 4). > I don't mean to imply that this is a trivial thing to implement in > Emacs but if you start getting up to building DFAs (very expensive in > the worst case) and taking intersections (very expensive in the worst > case) -- both also not all that simple to implement (nor obviously > possible for Emacs' extended regexp language) -- then the effort may > be comparable and (re-)visiting the option to adapt Rx to Emacs should > be worth considering. I have most of the DFA construction code written, but I may take you up on that anyway. BTW, regarding the "very expensive in the worst case", how common is this worst case in real life? Stefan ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <jwvlk2gpas3.fsf-monnier+emacsbugreports@gnu.org>]
* bug#192: regexp does not work as documented [not found] ` <jwvlk2gpas3.fsf-monnier+emacsbugreports@gnu.org> @ 2008-05-12 3:30 ` Thomas Lord [not found] ` <4827B9B8.30406@emf.net> 1 sibling, 0 replies; 20+ messages in thread From: Thomas Lord @ 2008-05-12 3:30 UTC (permalink / raw) To: Stefan Monnier Cc: Chong Yidong, 192, emacs-devel, David Koppelman, Bruno Haible Stefan Monnier wrote: > I have most of the DFA construction code written, but I may take you up > on that anyway. BTW, regarding the "very expensive in the worst case", > how common is this worst case in real life? > Talking about general purpose use of an engine, as opposed to something more narrow like "likely font-lock expressions": It's common enough in "real life" (in my experience) to be exactly the minimal amount required to make it annoying. Let me give you some rules of thumb to describe what I mean but I'll refrain from a long treatise explaining the theory that supports these. If you have questions I can "unpack" this on list or off. First, be careful to make a clear distinction between (let's dub the distinction) "regexps vs. regular expressions". "Regular expressions" correspond to formally regular languages. Regular expressions can always be converted to a DFA. "regexps" are what most people use in most situations. regexps include things like features to extract the positions that match a sub-pattern. regexps do *not* correspond to regular languages and *can not* always be converted to DFA. DFA conversion can help with regexp matching, but it can't solve it completely. Moreover, you can make pathological regexps (very slow to match) for every regexp I know of. Henry Spencer (I've heard) asserts that regexps are NP complete or something around there, though I haven't seen the proof. (Rx is a hybrid regular expression and regexp engine based on an on-line (caching, incremental) DFA conversion suggested in the "Dragon" compiler book and originating from (as I recall) an experiment by Thompson and one of the other Bell Labs guys. One of whoever it was privately mentioned that they abandoned it themselves because it was taking too much code to implement, or something like that.) That aside, regular expressions are probably plenty for font-lock functionality, so let's just talk about those. On-line DFA conversion for those is likely practical by multiple means. The size of a DFA is, worst-case, exponential in the size of the regular expression. This is the source of pathological cases that can thwart any DFA engine. (In exponential cases, Rx keeps running but more like an NFA, taking a corresponding performance hit.) For a time, I was pushing hard on Rx, trying to use it for absolutely everything. To make up new problems to solve as in "Ok, let's suppose I can run X-Kbyte long regular expressions with lots of | operators, stars, etc. What can I apply that to?" One example is lexing for real-world computer languages. [Aside: you can also use the same engine as the core of a shift-reduce parser, of course -- and I did that a long time ago with Guile, to useful effect.] Almost always, the expressions were such that Rx would have no problem and give very pleasing results. However, there were definitely times (for huge regular expressions and smaller ones) when things would just absolutely crawl. They happen "just often enough" to be an annoyance. Now, every case of that annoyance that I found (in real-life applications) had a solution. I could think for a while on *why* the regular expression I wrote was blowing up and then think of a different approach that eliminated the problem. Most often that meant not a different but equivalent regular expression -- it meant going one level up and changing the way the caller used regular expressions. The caller would retain the same functionality but different demands would be made on the regular expression (or regexp) engine. And that makes it tricky to drop *blindly* into Emacs. To solve the unavoidable annoying cases you really had to know how regular expressions worked at a deep level. To diagnose and work around the pathological cases took some expertise. As a general rule, for something "general purpose" like a libc implementation or the default regexp functions in Emacs lisp, there is something to be said for using NFA matchers. It's perverse why: Vast swaths of things that a DFA matcher can do very quickly an NFA matcher can not. It's reliably slow. Therefore, people not prepared to think hard about regexps tend to use an NFA matcher in pretty limited ways. A powerful regular expression engine could simplify their code, at the cost of risking finding a "hard case" -- but there's no issue since people quickly give up and don't try to push the match engine that hard. More concisely: If you use a DFA matcher You Better Know What You're Doing but also Most of the Time You Won't Need To So It's Very Convenient. If you use an NFA matcher You Can Get Away With Not Really Knowing What You're Doing but also You Won't Be Trying Anything Fancy, Perhaps Losing Out. One approach to toss into the mix, this one suggested by Per Bothner years ago, is to consider *offline* DFA conversion (a la 'lex(1)'). The advantage of offline (batch) conversion is that you can burn a lot of cycles on DFA minimization and, if your offline converter terminates, you've got a reliably linear matcher. The disadvantages for *many* uses of regular expressions in Emacs should be pretty obvious. For something like font-lock, where the regular expressions don't change that often, that might be a good approach -- precompile a minimal DFA and then add support for "regular expression continuations" when using those tables. -t ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <4827B9B8.30406@emf.net>]
* bug#192: regexp does not work as documented [not found] ` <4827B9B8.30406@emf.net> @ 2008-05-12 13:43 ` Stefan Monnier [not found] ` <jwvfxsnpryp.fsf-monnier+emacsbugreports@gnu.org> 1 sibling, 0 replies; 20+ messages in thread From: Stefan Monnier @ 2008-05-12 13:43 UTC (permalink / raw) To: Thomas Lord; +Cc: Chong Yidong, 192, emacs-devel, David Koppelman, Bruno Haible > years ago, is to consider *offline* DFA conversion (a la 'lex(1)'). That's what I do in lex.el. > The advantage of offline (batch) conversion is that you can burn a lot > of cycles on DFA minimization and, if your offline converter > terminates, you've got a reliably linear matcher. The disadvantages > for *many* uses of regular expressions in Emacs should be pretty > obvious. For something like font-lock, where the regular expressions > don't change that often, that might be a good approach -- precompile > a minimal DFA and then add support for "regular expression > continuations" when using those tables. I do not intend to replace src/regexp.c with a matcher based on offline DFA conversion. Actually, the need to support backrefs makes it pretty much impossible (tho I'm sure there's a way to adapt an offline DFA so it can be used with backrefs), and most importantly it has too different performance characteristics. More specifically, the compilation step should be made explicit. In any case I think you did answer my question: an offline DFA matcher is fine, the worst case is not that common and can be worked around. This is not that different from the current backtracking matcher. Stefan PS: The original motivation for a DFA-matcher is to extend syntax-tables so they can match match multi-char elements. ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <jwvfxsnpryp.fsf-monnier+emacsbugreports@gnu.org>]
* bug#192: regexp does not work as documented [not found] ` <jwvfxsnpryp.fsf-monnier+emacsbugreports@gnu.org> @ 2008-05-12 15:55 ` Thomas Lord [not found] ` <48286862.6040105@emf.net> 1 sibling, 0 replies; 20+ messages in thread From: Thomas Lord @ 2008-05-12 15:55 UTC (permalink / raw) To: Stefan Monnier Cc: Chong Yidong, 192, emacs-devel, David Koppelman, Bruno Haible Stefan Monnier wrote: > That's what I do in lex.el. > > Sounds nice. Last bits of experience report, then: If it isn't so already, it may be easy to make it so that a choice of which DFA is being used, plus a choice of the "current state" can be represented as lisp objects and cheaply copied. That gives the essence of "regular expression continuations". Handy features that shouldn't be difficult to add (if not present): Let programmers specify "labels" for each NFA state and then, for each DFA state, have either a list of all NFA labels that correspond to that DFA state and/or a more general way to "combine" NFA state labels to make the DFA label. You can wind up with many NFA states combined to a single DFA state, of course, so a "combine" function might be important. Include scanning functions to: ~ advance the DFA at most N characters (or until failure) ~ advance the DFA to the next non-nil state label (or failure) In both cases, give a way for lisp programs to get back not only the label (or failure indication) but also the regular expression continuation. Those features are handy so that (for example) lisp programs can hang a suspended regexp continuation on a buffer character as a property, doing incremental "re-lexing" in application-specific ways. The "advance to non-nil label" feature is useful for writing lisp programs that *do not* need back-referencing or sub-exp locations per se. It is a bit more speculative but also consider functions to: ~ advance the state of a DFA based on characters provided in a function call rather than read from a buffer -- e.g., a buffer position should not have to be part of the state of a running DFA. (advance-dfa re-continuation chr) => re-continuation Why that last one? Because then you can probably use the same DFA engine as the heart of a shift-reduce parser and (for languages that admit such things) write an incremental parser. (You'd be using non-buffer-position DFAs to process token ids emitted by the lexer.) You can also use such a feature for things like serial I/O protocols. Incremental parsers open the door to robust "syntax directed editing" which I think could be an exciting direction for IDE features to take. (Years ago, Thomas Reps and Tim Teitelbaum worked on the "Synthesizer Generator" which I recall had features along these lines (their parser guts were probably different from what I suggest). As I (now vaguely) recall there is a book that talks about their Emacs-based implementation.) Bye. Thanks. And good luck! -t ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <48286862.6040105@emf.net>]
* bug#192: regexp does not work as documented [not found] ` <48286862.6040105@emf.net> @ 2008-05-12 16:18 ` tomas 0 siblings, 0 replies; 20+ messages in thread From: tomas @ 2008-05-12 16:18 UTC (permalink / raw) To: Thomas Lord; +Cc: Chong Yidong, 192, emacs-devel, David Koppelman, Bruno Haible -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Mon, May 12, 2008 at 08:55:14AM -0700, Thomas Lord wrote: [...] > Why that last one? Because then you can probably use the same > DFA engine as the heart of a shift-reduce parser and (for languages > that admit such things) write an incremental parser. (You'd be using > non-buffer-position DFAs to process token ids emitted by the lexer.) > You can also use such a feature for things like serial I/O protocols. ...and cool things could be done with process-filter-function and its cousin after-insert-file-functions (i.e. parse input on-the-fly). Nifty stuff. Regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFIKG3jBcgs9XrR2kYRAtv8AJ9uj1wEjjT4bIPNQxYoYY5iPJW8cwCdE87U vsVarzdJhCu143kN7OGWh/Q= =0dcf -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <jwvfxsvbgg5.fsf-monnier+emacs@gnu.org>]
* bug#192: regexp does not work as documented [not found] ` <jwvfxsvbgg5.fsf-monnier+emacs@gnu.org> @ 2008-05-10 20:04 ` Bruno Haible 0 siblings, 0 replies; 20+ messages in thread From: Bruno Haible @ 2008-05-10 20:04 UTC (permalink / raw) To: Stefan Monnier; +Cc: Chong Yidong, 192, koppel, emacs-devel Stefan Monnier wrote: > > - I originally observed the bug in po-mode (part of GNU gettext), in > > a function po-find-span-of-entry which essentially only calls > > re-search-backward and re-search-forward. > > Please try and reproduce the problem there and send us a recipe. After deeper investigation, the bug in po-mode was not directly related: I had used a regexp which was not designed for use with re-search-backward. This bug is closed now. <https://savannah.gnu.org/bugs/?23177> Thank you Stefan for bringing me on the right track. Bruno ^ permalink raw reply [flat|nested] 20+ messages in thread
* regexp does not work as documented @ 2008-05-06 1:30 Bruno Haible 2015-12-29 17:48 ` bug#192: " Bruno Haible 0 siblings, 1 reply; 20+ messages in thread From: Bruno Haible @ 2008-05-06 1:30 UTC (permalink / raw) To: bug-gnu-emacs [-- Attachment #1: Type: text/plain, Size: 2640 bytes --] The regular expression (as an Emacs string) "^msgstr\\(\\[[0-9]\\]\\)?.*\n\\(\".*\n\\)*" is supposed to search for a line starting with "msgstr" and an optional digit inside brackets, followed by as many lines starting with a double-quote as possible. The Emacs documentation says: "The matcher processes a `*' construct by matching, immediately, as many repetitions as can be found." This is apparently not the case with the above regexp. ---------------------------------------------------------------------------- To reproduce: $ emacs fr.po or $ emacs -nw fr.po Leave the cursor at the beginning of the file. M-x highlight-regexp Regexp to highlight (enter this with Ctrl-Q Ctrl-J for each of the two newlines): ^msgstr\(\[[0-9]\]\)?.* \(".* \)* Highlight using face: hi-yellow Then move around in the buffer and look which lines are highlighted. In the first match already, only 5 out of 11 lines are highlighted. Then, line 300 (in emacs 22) or line 335 (in emacs 21) are not highlighted either. The attached screenshots were produced with emacs-22.2.1 (on i686-pc-linux-gnu) and with emacs-21.2.1 (on powerpc-apple-darwin7.9.0). ---------------------------------------------------------------------------- In GNU Emacs 22.2.1 (i686-pc-linux-gnu, X toolkit, Xaw3d scroll bars) of 2008-05-01 on linuix Windowing system distributor `The XFree86 Project, Inc', version 11.0.40300001 configured using `configure '--prefix=/packages/gnu'' Important settings: value of $LC_ALL: nil value of $LC_COLLATE: POSIX value of $LC_CTYPE: nil value of $LC_MESSAGES: nil value of $LC_MONETARY: nil value of $LC_NUMERIC: nil value of $LC_TIME: nil value of $LANG: de_DE.UTF-8 locale-coding-system: utf-8 default-enable-multibyte-characters: t Major mode: Lisp Interaction Minor modes in effect: tooltip-mode: t tool-bar-mode: t mouse-wheel-mode: t menu-bar-mode: t file-name-shadow-mode: t global-font-lock-mode: t font-lock-mode: t blink-cursor-mode: t unify-8859-on-encoding-mode: t utf-translate-cjk-mode: t auto-compression-mode: t line-number-mode: t Recent input: <help-echo> <help-echo> <help-echo> <help-echo> <help-echo> <help-echo> <help-echo> <help-echo> <help-echo> <help-echo> <help-echo> <help-echo> <menu-bar> <help-menu> <send-emacs-bug-report> Recent messages: ("emacs") Loading cl-indent...done Loading derived...done For information about GNU Emacs and the GNU system, type C-h C-a. Loading emacsbug... Loading regexp-opt...done Loading emacsbug...done PS: Where is the bug tracker? I don't see it at https://savannah.gnu.org/projects/emacs [-- Attachment #2: emacs21-1.png --] [-- Type: image/png, Size: 18408 bytes --] [-- Attachment #3: emacs22-2.png --] [-- Type: image/png, Size: 19671 bytes --] [-- Attachment #4: emacs21-2.png --] [-- Type: image/png, Size: 20154 bytes --] [-- Attachment #5: emacs22-1.png --] [-- Type: image/png, Size: 18243 bytes --] [-- Attachment #6: fr.po.gz --] [-- Type: application/x-gzip, Size: 44518 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* bug#192: regexp does not work as documented 2008-05-06 1:30 Bruno Haible @ 2015-12-29 17:48 ` Bruno Haible 0 siblings, 0 replies; 20+ messages in thread From: Bruno Haible @ 2015-12-29 17:48 UTC (permalink / raw) To: 192 The bug was reproducible with emacs 22.2.1. I now verified that it is fixed in emacs 23.1. Thanks! ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2015-12-29 17:48 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <87k5i8ukq8.fsf@stupidchicken.com> [not found] ` <200805061335.11379.bruno@clisp.org> [not found] ` <48204B3D.6000500@gmx.at> 2008-05-10 19:18 ` bug#192: regexp does not work as documented David Koppelman [not found] ` <yg5skwqc6ho.fsf@nested.ece.lsu.edu> 2008-05-10 20:13 ` David Koppelman [not found] ` <yg5bq3ddij2.fsf@nested.ece.lsu.edu> 2008-05-11 7:40 ` martin rudalics [not found] ` <4826A303.3030002@gmx.at> 2008-05-11 14:27 ` Chong Yidong [not found] ` <87abiwoqzd.fsf@stupidchicken.com> 2008-05-11 15:36 ` David Koppelman [not found] ` <yg57ie0df8u.fsf@nested.ece.lsu.edu> 2008-05-11 18:44 ` Stefan Monnier [not found] ` <jwv4p94r8vp.fsf-monnier+emacsbugreports@gnu.org> 2008-05-11 19:09 ` David Koppelman [not found] ` <yg5tzh4bqtw.fsf@nested.ece.lsu.edu> 2008-05-12 1:28 ` Stefan Monnier [not found] ` <jwvr6c8pbd6.fsf-monnier+emacsbugreports@gnu.org> 2008-05-12 15:03 ` David Koppelman [not found] ` <yg5d4nra7jb.fsf@nested.ece.lsu.edu> 2008-05-12 16:29 ` Stefan Monnier [not found] ` <jwvzlqvmr6g.fsf-monnier+emacsbugreports@gnu.org> 2008-05-12 17:04 ` David Koppelman 2008-05-11 18:44 ` Stefan Monnier [not found] ` <jwv8wygrbss.fsf-monnier+emacsbugreports@gnu.org> 2008-05-11 20:03 ` Thomas Lord [not found] ` <482750F4.2050102@emf.net> 2008-05-12 1:43 ` Stefan Monnier [not found] ` <jwvlk2gpas3.fsf-monnier+emacsbugreports@gnu.org> 2008-05-12 3:30 ` Thomas Lord [not found] ` <4827B9B8.30406@emf.net> 2008-05-12 13:43 ` Stefan Monnier [not found] ` <jwvfxsnpryp.fsf-monnier+emacsbugreports@gnu.org> 2008-05-12 15:55 ` Thomas Lord [not found] ` <48286862.6040105@emf.net> 2008-05-12 16:18 ` tomas [not found] ` <jwvfxsvbgg5.fsf-monnier+emacs@gnu.org> 2008-05-10 20:04 ` Bruno Haible 2008-05-06 1:30 Bruno Haible 2015-12-29 17:48 ` bug#192: " Bruno Haible
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).