* bug#19873: Ill-formed regular expression is constructed in forward-paragraph. @ 2015-02-15 10:31 Alan Mackenzie 2017-02-26 16:44 ` Marcin Borkowski 0 siblings, 1 reply; 11+ messages in thread From: Alan Mackenzie @ 2015-02-15 10:31 UTC (permalink / raw) To: 19873 Hello, Emacs! In forward-paragraph, L37, a regular expression is constructed as follows: (let* ... (sp-parstart (concat "^[ \t]*\\(?:" parstart "\\|" parsep "\\)")) ...) . Here parstart and parsep are, more or less, paragraph-{start,separate}. The problem is that parstart and parsep themselves are likely to begin with "[ \t]*" (the default values certainly do), so we have two consecutive matchers for an arbitrary amount of whitespace. This causes the regexp engine to run very slowly when a line starts with lots of WS but doesn't match. This problem seems to be the cause of bug # 19846 (where holding down the spacebar inside a C comment causes Emacs to seize up when auto-fill mode is enabled). -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#19873: Ill-formed regular expression is constructed in forward-paragraph. 2015-02-15 10:31 bug#19873: Ill-formed regular expression is constructed in forward-paragraph Alan Mackenzie @ 2017-02-26 16:44 ` Marcin Borkowski 2017-02-26 16:57 ` Eli Zaretskii ` (2 more replies) 0 siblings, 3 replies; 11+ messages in thread From: Marcin Borkowski @ 2017-02-26 16:44 UTC (permalink / raw) To: Alan Mackenzie; +Cc: 19873 On 2015-02-15, at 10:31, Alan Mackenzie <acm@muc.de> wrote: > Hello, Emacs! > > In forward-paragraph, L37, a regular expression is constructed as > follows: > > (let* ... > (sp-parstart (concat "^[ \t]*\\(?:" parstart "\\|" parsep "\\)")) > ...) > > . Here parstart and parsep are, more or less, > paragraph-{start,separate}. > > The problem is that parstart and parsep themselves are likely to begin > with "[ \t]*" (the default values certainly do), so we have two > consecutive matchers for an arbitrary amount of whitespace. This causes > the regexp engine to run very slowly when a line starts with lots of WS > but doesn't match. > > This problem seems to be the cause of bug # 19846 (where holding down the > spacebar inside a C comment causes Emacs to seize up when auto-fill mode > is enabled). Hi Alan, hi all, I put this bug on my todo-list some time ago and decided now to revisit it. I'm wondering what could be done about it. First of all, my Emacs has this as paragraph-start: "\f\\|[ ]*$" and this as paragraph-separate: "[ \f]*$" and frankly speaking, I'm not sure why they differ at all (by default). Also, even though forward-paragraph checks for "^" at their beginning, they actually don't begin with that character (again, by default). My first thought is to add a check whether paragraph-start and paragraph-sep match something like "^\\^?\\[[[:space:]]+\\][+*]?" and if yes, make parstart/parsep equal to them, but without the matching part. WDYT? -- Marcin Borkowski http://octd.wmi.amu.edu.pl/en/Marcin_Borkowski Faculty of Mathematics and Computer Science Adam Mickiewicz University ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#19873: Ill-formed regular expression is constructed in forward-paragraph. 2017-02-26 16:44 ` Marcin Borkowski @ 2017-02-26 16:57 ` Eli Zaretskii 2017-02-26 18:48 ` Marcin Borkowski 2017-03-07 16:47 ` Eli Zaretskii 2017-03-09 21:04 ` Alan Mackenzie 2 siblings, 1 reply; 11+ messages in thread From: Eli Zaretskii @ 2017-02-26 16:57 UTC (permalink / raw) To: Marcin Borkowski; +Cc: acm, 19873 > From: Marcin Borkowski <mbork@amu.edu.pl> > Date: Sun, 26 Feb 2017 17:44:51 +0100 > Cc: 19873@debbugs.gnu.org > > First of all, my Emacs has this as paragraph-start: > > "\f\\|[ ]*$" > > and this as paragraph-separate: > > "[ \f]*$" > > and frankly speaking, I'm not sure why they differ at all (by default). I believe this is explained in the Emacs manual, node "Paragraphs". In a nutshell, these two regexps have different purposes. ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#19873: Ill-formed regular expression is constructed in forward-paragraph. 2017-02-26 16:57 ` Eli Zaretskii @ 2017-02-26 18:48 ` Marcin Borkowski 0 siblings, 0 replies; 11+ messages in thread From: Marcin Borkowski @ 2017-02-26 18:48 UTC (permalink / raw) To: Eli Zaretskii; +Cc: acm, 19873 On 2017-02-26, at 17:57, Eli Zaretskii <eliz@gnu.org> wrote: >> From: Marcin Borkowski <mbork@amu.edu.pl> >> Date: Sun, 26 Feb 2017 17:44:51 +0100 >> Cc: 19873@debbugs.gnu.org >> >> First of all, my Emacs has this as paragraph-start: >> >> "\f\\|[ ]*$" >> >> and this as paragraph-separate: >> >> "[ \f]*$" >> >> and frankly speaking, I'm not sure why they differ at all (by default). > > I believe this is explained in the Emacs manual, node "Paragraphs". > In a nutshell, these two regexps have different purposes. My bad, you're right. Still, this has little to do with the bug itself. Best, -- Marcin Borkowski http://octd.wmi.amu.edu.pl/en/Marcin_Borkowski Faculty of Mathematics and Computer Science Adam Mickiewicz University ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#19873: Ill-formed regular expression is constructed in forward-paragraph. 2017-02-26 16:44 ` Marcin Borkowski 2017-02-26 16:57 ` Eli Zaretskii @ 2017-03-07 16:47 ` Eli Zaretskii 2017-03-09 21:04 ` Alan Mackenzie 2 siblings, 0 replies; 11+ messages in thread From: Eli Zaretskii @ 2017-03-07 16:47 UTC (permalink / raw) To: Marcin Borkowski; +Cc: acm, 19873 > From: Marcin Borkowski <mbork@amu.edu.pl> > Date: Sun, 26 Feb 2017 17:44:51 +0100 > Cc: 19873@debbugs.gnu.org > > My first thought is to add a check whether paragraph-start and > paragraph-sep match something like > > "^\\^?\\[[[:space:]]+\\][+*]?" > > and if yes, make parstart/parsep equal to them, but without the matching > part. > > WDYT? Can you suggest a patch along these lines? Alan, any comments? Thanks. ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#19873: Ill-formed regular expression is constructed in forward-paragraph. 2017-02-26 16:44 ` Marcin Borkowski 2017-02-26 16:57 ` Eli Zaretskii 2017-03-07 16:47 ` Eli Zaretskii @ 2017-03-09 21:04 ` Alan Mackenzie 2021-12-02 10:39 ` Lars Ingebrigtsen 2 siblings, 1 reply; 11+ messages in thread From: Alan Mackenzie @ 2017-03-09 21:04 UTC (permalink / raw) To: Marcin Borkowski; +Cc: 19873 Hello, Marcin. On Sun, Feb 26, 2017 at 17:44:51 +0100, Marcin Borkowski wrote: > On 2015-02-15, at 10:31, Alan Mackenzie <acm@muc.de> wrote: > > Hello, Emacs! > > In forward-paragraph, L37, a regular expression is constructed as > > follows: > > (let* ... > > (sp-parstart (concat "^[ \t]*\\(?:" parstart "\\|" parsep "\\)")) > > ...) > > . Here parstart and parsep are, more or less, > > paragraph-{start,separate}. > > The problem is that parstart and parsep themselves are likely to begin > > with "[ \t]*" (the default values certainly do), so we have two > > consecutive matchers for an arbitrary amount of whitespace. This causes > > the regexp engine to run very slowly when a line starts with lots of WS > > but doesn't match. > > This problem seems to be the cause of bug # 19846 (where holding down the > > spacebar inside a C comment causes Emacs to seize up when auto-fill mode > > is enabled). > Hi Alan, hi all, > I put this bug on my todo-list some time ago and decided now to revisit > it. > I'm wondering what could be done about it. First of all, my Emacs has > this as paragraph-start: > "\f\\|[ ]*$" > and this as paragraph-separate: > "[ \f]*$" > and frankly speaking, I'm not sure why they differ at all (by default). > Also, even though forward-paragraph checks for "^" at their beginning, > they actually don't begin with that character (again, by default). > My first thought is to add a check whether paragraph-start and > paragraph-sep match something like > "^\\^?\\[[[:space:]]+\\][+*]?" > and if yes, make parstart/parsep equal to them, but without the matching > part. > WDYT? My first reaction is "This is a good idea, but be very careful!". For example, if paragraph-start and/or paragraph-separate begin with "[ \t]+" (i.e. the paragraph start requires space at BOL), you will miss it by removing matches of "^\\^?\\[[[:space:]]+\\][+*]?" from them. I think this idea is workable, but you'll have to check for one or both of paragraph-s{tart,eparate} starting with "[ \t]+". A good strategy here might be to begin the target regexp with "^[ \t]*", then begin one or both components with "[ \t]" (without the "*"). There may be other gotchas which I haven't thought about yet. One needs a twisted mind to do this sort of thing properly, so I offer my services to review your upcoming patch. ;-) > -- > Marcin Borkowski > http://octd.wmi.amu.edu.pl/en/Marcin_Borkowski > Faculty of Mathematics and Computer Science > Adam Mickiewicz University -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#19873: Ill-formed regular expression is constructed in forward-paragraph. 2017-03-09 21:04 ` Alan Mackenzie @ 2021-12-02 10:39 ` Lars Ingebrigtsen 2021-12-02 10:44 ` Lars Ingebrigtsen 2021-12-02 20:45 ` Alan Mackenzie 0 siblings, 2 replies; 11+ messages in thread From: Lars Ingebrigtsen @ 2021-12-02 10:39 UTC (permalink / raw) To: Alan Mackenzie; +Cc: Marcin Borkowski, 19873 Alan Mackenzie <acm@muc.de> writes: > I think this idea is workable, but you'll have to check for one or both > of paragraph-s{tart,eparate} starting with "[ \t]+". A good strategy > here might be to begin the target regexp with "^[ \t]*", then begin one > or both components with "[ \t]" (without the "*"). > > There may be other gotchas which I haven't thought about yet. > > One needs a twisted mind to do this sort of thing properly, so I offer my > services to review your upcoming patch. ;-) The problem seems rather intractable to me. Is there really any way to examine a regexp to determine "does this in practice match [ \t]*"? I wonder whether instead of trying to construct a better overall regexp could rewrite the loop. That is, instead of searching for sp-parstart, search for parstart "\\|" parsep, and then check whether (match-beginning 0) of that comes after "^[ \t]*". Or something along those lines. But I don't know whether that'd be any faster in practice. Do you have a test case that demonstrates the slowness? In that case I could try to see whether there's any alternate approach here that's faster. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#19873: Ill-formed regular expression is constructed in forward-paragraph. 2021-12-02 10:39 ` Lars Ingebrigtsen @ 2021-12-02 10:44 ` Lars Ingebrigtsen 2021-12-02 11:17 ` Lars Ingebrigtsen 2021-12-02 20:45 ` Alan Mackenzie 1 sibling, 1 reply; 11+ messages in thread From: Lars Ingebrigtsen @ 2021-12-02 10:44 UTC (permalink / raw) To: Alan Mackenzie; +Cc: Marcin Borkowski, 19873 Lars Ingebrigtsen <larsi@gnus.org> writes: > Do you have a test case that demonstrates the slowness? In that case I > could try to see whether there's any alternate approach here that's > faster. Ah, there was a reproducer in bug#19846 (and it still reproduces). I'll poke around here, then. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#19873: Ill-formed regular expression is constructed in forward-paragraph. 2021-12-02 10:44 ` Lars Ingebrigtsen @ 2021-12-02 11:17 ` Lars Ingebrigtsen 0 siblings, 0 replies; 11+ messages in thread From: Lars Ingebrigtsen @ 2021-12-02 11:17 UTC (permalink / raw) To: Alan Mackenzie; +Cc: Marcin Borkowski, 19873 Lars Ingebrigtsen <larsi@gnus.org> writes: > Ah, there was a reproducer in bug#19846 (and it still reproduces). > > I'll poke around here, then. The following is about 100x faster in the test case, and doesn't seem to lead to any obvious regressions. But it's kinda ugly. Any comments? An alternate approach would be to just go line by line, and see whether the regexp matches at the start of the line (either before or after skipping past any leading whitespace). But I'm not sure that's any prettier. diff --git a/lisp/textmodes/paragraphs.el b/lisp/textmodes/paragraphs.el index 59b15e82a8..e0d633d49b 100644 --- a/lisp/textmodes/paragraphs.el +++ b/lisp/textmodes/paragraphs.el @@ -238,7 +238,7 @@ forward-paragraph fill-prefix-regexp "[ \t]*$") parsep)) ;; This is used for searching. - (sp-parstart (concat "^[ \t]*\\(?:" parstart "\\|" parsep "\\)")) + (sp-parstart (concat "\\(?:" parstart "\\|" parsep "\\)")) start found-start) (while (and (< arg 0) (not (bobp))) (if (and (not (looking-at parsep)) @@ -278,7 +278,7 @@ forward-paragraph ;; multiple-lines ;; (forward-line 1)) (not (bobp))) - (while (and (re-search-backward sp-parstart nil 1) + (while (and (paragraph--line-search nil sp-parstart) (setq found-start t) ;; Found a candidate, but need to check if it is a ;; REAL parstart. @@ -328,7 +328,7 @@ forward-paragraph (not (looking-at parsep)) (looking-at fill-prefix-regexp)) (forward-line 1)) - (while (and (re-search-forward sp-parstart nil 1) + (while (and (paragraph--line-search t sp-parstart) (progn (setq start (match-beginning 0)) (goto-char start) (not (eobp))) @@ -344,6 +344,34 @@ forward-paragraph ;; Return the number of steps that could not be done. arg)) +;; We do it this way because the regexp commonly starts with optional +;; whitespace. +(defun paragraph--line-search (forward regexp) + "Look for REGEXP starting on a line. +If FORWARD, search forward. If not, go backward." + (catch 'found + (while (and (or (and forward + (not (eobp))) + (and (not forward) + (not (bobp)))) + (funcall (if forward + #'re-search-forward + #'re-search-backward) + regexp nil 1)) + (save-excursion + (goto-char (match-beginning 0)) + (beginning-of-line) + (save-restriction + (narrow-to-region + (point) (match-beginning 0)) + (when (looking-at-p "[ \t]*$") + (throw 'found t)))) + (if forward + (unless (eobp) + (forward-char 1)) + (unless (bobp) + (backward-char 1)))))) + (defun backward-paragraph (&optional arg) "Move backward to start of paragraph. With argument ARG, do it ARG times; -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply related [flat|nested] 11+ messages in thread
* bug#19873: Ill-formed regular expression is constructed in forward-paragraph. 2021-12-02 10:39 ` Lars Ingebrigtsen 2021-12-02 10:44 ` Lars Ingebrigtsen @ 2021-12-02 20:45 ` Alan Mackenzie 2021-12-03 16:15 ` Lars Ingebrigtsen 1 sibling, 1 reply; 11+ messages in thread From: Alan Mackenzie @ 2021-12-02 20:45 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: Marcin Borkowski, 19873 Hello, Lars. On Thu, Dec 02, 2021 at 11:39:51 +0100, Lars Ingebrigtsen wrote: > Alan Mackenzie <acm@muc.de> writes: > > I think this idea is workable, but you'll have to check for one or both > > of paragraph-s{tart,eparate} starting with "[ \t]+". A good strategy > > here might be to begin the target regexp with "^[ \t]*", then begin one > > or both components with "[ \t]" (without the "*"). > > There may be other gotchas which I haven't thought about yet. > > One needs a twisted mind to do this sort of thing properly, so I offer my > > services to review your upcoming patch. ;-) > The problem seems rather intractable to me. Is there really any way to > examine a regexp to determine "does this in practice match [ \t]*"? Back when the bug was new, I started writing a library to analyse a regular expression and convert it into an equivalent well formed regular expression. It's actually working, but is incomplete. It's currently 2757 lines long, including pretty complete unit testing. I actually looked at it again at the start of November. > I wonder whether instead of trying to construct a better overall regexp > could rewrite the loop. That is, instead of searching for sp-parstart, > search for parstart "\\|" parsep, and then check whether > (match-beginning 0) of that comes after "^[ \t]*". Or something along > those lines. > But I don't know whether that'd be any faster in practice. It strikes me as one of these things which needs to be done systematically, which, as I said, I've already tried (and not yet given up). The question presents itself, would the effort be better spent improving Emacs's regexp engine? > Do you have a test case that demonstrates the slowness? In that case I > could try to see whether there's any alternate approach here that's > faster. Martin Rudalics had the original testcase. The slowness was exponential with the number of spaces typed, I think. > -- > (domestic pets only, the antidote for overdose, milk.) > bloggy blog: http://lars.ingebrigtsen.no -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#19873: Ill-formed regular expression is constructed in forward-paragraph. 2021-12-02 20:45 ` Alan Mackenzie @ 2021-12-03 16:15 ` Lars Ingebrigtsen 0 siblings, 0 replies; 11+ messages in thread From: Lars Ingebrigtsen @ 2021-12-03 16:15 UTC (permalink / raw) To: Alan Mackenzie; +Cc: Marcin Borkowski, 19873 Alan Mackenzie <acm@muc.de> writes: > Back when the bug was new, I started writing a library to analyse a > regular expression and convert it into an equivalent well formed regular > expression. It's actually working, but is incomplete. It's currently > 2757 lines long, including pretty complete unit testing. I actually > looked at it again at the start of November. Interesting. > It strikes me as one of these things which needs to be done > systematically, which, as I said, I've already tried (and not yet given > up). The question presents itself, would the effort be better spent > improving Emacs's regexp engine? Indeed -- making the Emacs regexp engine transform these complex regexps into simpler, equivalent forms would be great. We wouldn't have to use it on all regexps -- the problem usually rears its head when we're combining several user-defined regexps into a large one, so if we had a `simplify-regexp' function that we could stick in here, that'd solve the issue here (and in similar circumstances elsewhere). -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2021-12-03 16:15 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-02-15 10:31 bug#19873: Ill-formed regular expression is constructed in forward-paragraph Alan Mackenzie 2017-02-26 16:44 ` Marcin Borkowski 2017-02-26 16:57 ` Eli Zaretskii 2017-02-26 18:48 ` Marcin Borkowski 2017-03-07 16:47 ` Eli Zaretskii 2017-03-09 21:04 ` Alan Mackenzie 2021-12-02 10:39 ` Lars Ingebrigtsen 2021-12-02 10:44 ` Lars Ingebrigtsen 2021-12-02 11:17 ` Lars Ingebrigtsen 2021-12-02 20:45 ` Alan Mackenzie 2021-12-03 16:15 ` Lars Ingebrigtsen
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.