bug#19873: Ill-formed regular expression is constructed in forward-paragraph.

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#19873: Ill-formed regular expression is constructed in forward-paragraph.
@ 2015-02-15 10:31 Alan Mackenzie
  2017-02-26 16:44 ` Marcin Borkowski
  0 siblings, 1 reply; 11+ messages in thread
From: Alan Mackenzie @ 2015-02-15 10:31 UTC (permalink / raw)
  To: 19873

Hello, Emacs!

In forward-paragraph, L37, a regular expression is constructed as
follows:

(let* ...
 (sp-parstart (concat "^[ \t]*\\(?:" parstart "\\|" parsep "\\)"))
 ...)

.  Here parstart and parsep are, more or less,
paragraph-{start,separate}.

The problem is that parstart and parsep themselves are likely to begin
with "[ \t]*" (the default values certainly do), so we have two
consecutive matchers for an arbitrary amount of whitespace.  This causes
the regexp engine to run very slowly when a line starts with lots of WS
but doesn't match.

This problem seems to be the cause of bug # 19846 (where holding down the
spacebar inside a C comment causes Emacs to seize up when auto-fill mode
is enabled).

-- 
Alan Mackenzie (Nuremberg, Germany).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#19873: Ill-formed regular expression is constructed in forward-paragraph.
  2015-02-15 10:31 bug#19873: Ill-formed regular expression is constructed in forward-paragraph Alan Mackenzie
@ 2017-02-26 16:44 ` Marcin Borkowski
  2017-02-26 16:57   ` Eli Zaretskii
                     ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Marcin Borkowski @ 2017-02-26 16:44 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: 19873

On 2015-02-15, at 10:31, Alan Mackenzie <acm@muc.de> wrote:

> Hello, Emacs!
>
> In forward-paragraph, L37, a regular expression is constructed as
> follows:
>
> (let* ...
>  (sp-parstart (concat "^[ \t]*\\(?:" parstart "\\|" parsep "\\)"))
>  ...)
>
> .  Here parstart and parsep are, more or less,
> paragraph-{start,separate}.
>
> The problem is that parstart and parsep themselves are likely to begin
> with "[ \t]*" (the default values certainly do), so we have two
> consecutive matchers for an arbitrary amount of whitespace.  This causes
> the regexp engine to run very slowly when a line starts with lots of WS
> but doesn't match.
>
> This problem seems to be the cause of bug # 19846 (where holding down the
> spacebar inside a C comment causes Emacs to seize up when auto-fill mode
> is enabled).

Hi Alan, hi all,

I put this bug on my todo-list some time ago and decided now to revisit
it.

I'm wondering what could be done about it.  First of all, my Emacs has
this as paragraph-start:

"\f\\|[ 	]*$"

and this as paragraph-separate:

"[ 	\f]*$"

and frankly speaking, I'm not sure why they differ at all (by default).
Also, even though forward-paragraph checks for "^" at their beginning,
they actually don't begin with that character (again, by default).

My first thought is to add a check whether paragraph-start and
paragraph-sep match something like

"^\\^?\\[[[:space:]]+\\][+*]?"

and if yes, make parstart/parsep equal to them, but without the matching
part.

WDYT?

-- 
Marcin Borkowski
http://octd.wmi.amu.edu.pl/en/Marcin_Borkowski
Faculty of Mathematics and Computer Science
Adam Mickiewicz University





^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#19873: Ill-formed regular expression is constructed in forward-paragraph.
  2017-02-26 16:44 ` Marcin Borkowski
@ 2017-02-26 16:57   ` Eli Zaretskii
  2017-02-26 18:48     ` Marcin Borkowski
  2017-03-07 16:47   ` Eli Zaretskii
  2017-03-09 21:04   ` Alan Mackenzie
  2 siblings, 1 reply; 11+ messages in thread
From: Eli Zaretskii @ 2017-02-26 16:57 UTC (permalink / raw)
  To: Marcin Borkowski; +Cc: acm, 19873

> From: Marcin Borkowski <mbork@amu.edu.pl>
> Date: Sun, 26 Feb 2017 17:44:51 +0100
> Cc: 19873@debbugs.gnu.org
> 
> First of all, my Emacs has this as paragraph-start:
> 
> "\f\\|[ 	]*$"
> 
> and this as paragraph-separate:
> 
> "[ 	\f]*$"
> 
> and frankly speaking, I'm not sure why they differ at all (by default).

I believe this is explained in the Emacs manual, node "Paragraphs".
In a nutshell, these two regexps have different purposes.





^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#19873: Ill-formed regular expression is constructed in forward-paragraph.
  2017-02-26 16:57   ` Eli Zaretskii
@ 2017-02-26 18:48     ` Marcin Borkowski
  0 siblings, 0 replies; 11+ messages in thread
From: Marcin Borkowski @ 2017-02-26 18:48 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: acm, 19873


On 2017-02-26, at 17:57, Eli Zaretskii <eliz@gnu.org> wrote:

>> From: Marcin Borkowski <mbork@amu.edu.pl>
>> Date: Sun, 26 Feb 2017 17:44:51 +0100
>> Cc: 19873@debbugs.gnu.org
>> 
>> First of all, my Emacs has this as paragraph-start:
>> 
>> "\f\\|[ 	]*$"
>> 
>> and this as paragraph-separate:
>> 
>> "[ 	\f]*$"
>> 
>> and frankly speaking, I'm not sure why they differ at all (by default).
>
> I believe this is explained in the Emacs manual, node "Paragraphs".
> In a nutshell, these two regexps have different purposes.

My bad, you're right.  Still, this has little to do with the bug itself.

Best,

-- 
Marcin Borkowski
http://octd.wmi.amu.edu.pl/en/Marcin_Borkowski
Faculty of Mathematics and Computer Science
Adam Mickiewicz University





^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#19873: Ill-formed regular expression is constructed in forward-paragraph.
  2017-02-26 16:44 ` Marcin Borkowski
  2017-02-26 16:57   ` Eli Zaretskii
@ 2017-03-07 16:47   ` Eli Zaretskii
  2017-03-09 21:04   ` Alan Mackenzie
  2 siblings, 0 replies; 11+ messages in thread
From: Eli Zaretskii @ 2017-03-07 16:47 UTC (permalink / raw)
  To: Marcin Borkowski; +Cc: acm, 19873

> From: Marcin Borkowski <mbork@amu.edu.pl>
> Date: Sun, 26 Feb 2017 17:44:51 +0100
> Cc: 19873@debbugs.gnu.org
> 
> My first thought is to add a check whether paragraph-start and
> paragraph-sep match something like
> 
> "^\\^?\\[[[:space:]]+\\][+*]?"
> 
> and if yes, make parstart/parsep equal to them, but without the matching
> part.
> 
> WDYT?

Can you suggest a patch along these lines?

Alan, any comments?

Thanks.





^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#19873: Ill-formed regular expression is constructed in forward-paragraph.
  2017-02-26 16:44 ` Marcin Borkowski
  2017-02-26 16:57   ` Eli Zaretskii
  2017-03-07 16:47   ` Eli Zaretskii
@ 2017-03-09 21:04   ` Alan Mackenzie
  2021-12-02 10:39     ` Lars Ingebrigtsen
  2 siblings, 1 reply; 11+ messages in thread
From: Alan Mackenzie @ 2017-03-09 21:04 UTC (permalink / raw)
  To: Marcin Borkowski; +Cc: 19873

Hello, Marcin.

On Sun, Feb 26, 2017 at 17:44:51 +0100, Marcin Borkowski wrote:
> On 2015-02-15, at 10:31, Alan Mackenzie <acm@muc.de> wrote:

> > Hello, Emacs!

> > In forward-paragraph, L37, a regular expression is constructed as
> > follows:

> > (let* ...
> >  (sp-parstart (concat "^[ \t]*\\(?:" parstart "\\|" parsep "\\)"))
> >  ...)

> > .  Here parstart and parsep are, more or less,
> > paragraph-{start,separate}.

> > The problem is that parstart and parsep themselves are likely to begin
> > with "[ \t]*" (the default values certainly do), so we have two
> > consecutive matchers for an arbitrary amount of whitespace.  This causes
> > the regexp engine to run very slowly when a line starts with lots of WS
> > but doesn't match.

> > This problem seems to be the cause of bug # 19846 (where holding down the
> > spacebar inside a C comment causes Emacs to seize up when auto-fill mode
> > is enabled).

> Hi Alan, hi all,

> I put this bug on my todo-list some time ago and decided now to revisit
> it.

> I'm wondering what could be done about it.  First of all, my Emacs has
> this as paragraph-start:

> "\f\\|[ 	]*$"

> and this as paragraph-separate:

> "[ 	\f]*$"

> and frankly speaking, I'm not sure why they differ at all (by default).
> Also, even though forward-paragraph checks for "^" at their beginning,
> they actually don't begin with that character (again, by default).

> My first thought is to add a check whether paragraph-start and
> paragraph-sep match something like

> "^\\^?\\[[[:space:]]+\\][+*]?"

> and if yes, make parstart/parsep equal to them, but without the matching
> part.

> WDYT?

My first reaction is "This is a good idea, but be very careful!".  For
example, if paragraph-start and/or paragraph-separate begin with
"[ \t]+" (i.e. the paragraph start requires space at BOL), you will miss
it by removing matches of "^\\^?\\[[[:space:]]+\\][+*]?" from them.

I think this idea is workable, but you'll have to check for one or both
of paragraph-s{tart,eparate} starting with "[ \t]+".  A good strategy
here might be to begin the target regexp with "^[ \t]*", then begin one
or both components with "[ \t]" (without the "*").

There may be other gotchas which I haven't thought about yet.

One needs a twisted mind to do this sort of thing properly, so I offer my
services to review your upcoming patch.  ;-)

> -- 
> Marcin Borkowski
> http://octd.wmi.amu.edu.pl/en/Marcin_Borkowski
> Faculty of Mathematics and Computer Science
> Adam Mickiewicz University

-- 
Alan Mackenzie (Nuremberg, Germany).





^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#19873: Ill-formed regular expression is constructed in forward-paragraph.
  2017-03-09 21:04   ` Alan Mackenzie
@ 2021-12-02 10:39     ` Lars Ingebrigtsen
  2021-12-02 10:44       ` Lars Ingebrigtsen
  2021-12-02 20:45       ` Alan Mackenzie
  0 siblings, 2 replies; 11+ messages in thread
From: Lars Ingebrigtsen @ 2021-12-02 10:39 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: Marcin Borkowski, 19873

Alan Mackenzie <acm@muc.de> writes:

> I think this idea is workable, but you'll have to check for one or both
> of paragraph-s{tart,eparate} starting with "[ \t]+".  A good strategy
> here might be to begin the target regexp with "^[ \t]*", then begin one
> or both components with "[ \t]" (without the "*").
>
> There may be other gotchas which I haven't thought about yet.
>
> One needs a twisted mind to do this sort of thing properly, so I offer my
> services to review your upcoming patch.  ;-)

The problem seems rather intractable to me.  Is there really any way to
examine a regexp to determine "does this in practice match [ \t]*"?

I wonder whether instead of trying to construct a better overall regexp
could rewrite the loop.  That is, instead of searching for sp-parstart,
search for parstart "\\|" parsep, and then check whether
(match-beginning 0) of that comes after "^[ \t]*".  Or something along
those lines.

But I don't know whether that'd be any faster in practice.

Do you have a test case that demonstrates the slowness?  In that case I
could try to see whether there's any alternate approach here that's
faster.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#19873: Ill-formed regular expression is constructed in forward-paragraph.
  2021-12-02 10:39     ` Lars Ingebrigtsen
@ 2021-12-02 10:44       ` Lars Ingebrigtsen
  2021-12-02 11:17         ` Lars Ingebrigtsen
  2021-12-02 20:45       ` Alan Mackenzie
  1 sibling, 1 reply; 11+ messages in thread
From: Lars Ingebrigtsen @ 2021-12-02 10:44 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: Marcin Borkowski, 19873

Lars Ingebrigtsen <larsi@gnus.org> writes:

> Do you have a test case that demonstrates the slowness?  In that case I
> could try to see whether there's any alternate approach here that's
> faster.

Ah, there was a reproducer in bug#19846 (and it still reproduces).

I'll poke around here, then.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#19873: Ill-formed regular expression is constructed in forward-paragraph.
  2021-12-02 10:44       ` Lars Ingebrigtsen
@ 2021-12-02 11:17         ` Lars Ingebrigtsen
  0 siblings, 0 replies; 11+ messages in thread
From: Lars Ingebrigtsen @ 2021-12-02 11:17 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: Marcin Borkowski, 19873

Lars Ingebrigtsen <larsi@gnus.org> writes:

> Ah, there was a reproducer in bug#19846 (and it still reproduces).
>
> I'll poke around here, then.

The following is about 100x faster in the test case, and doesn't seem to
lead to any obvious regressions.

But it's kinda ugly.

Any comments?

An alternate approach would be to just go line by line, and see whether
the regexp matches at the start of the line (either before or after
skipping past any leading whitespace).  But I'm not sure that's any
prettier.

diff --git a/lisp/textmodes/paragraphs.el b/lisp/textmodes/paragraphs.el
index 59b15e82a8..e0d633d49b 100644
--- a/lisp/textmodes/paragraphs.el
+++ b/lisp/textmodes/paragraphs.el
@@ -238,7 +238,7 @@ forward-paragraph
 		      fill-prefix-regexp "[ \t]*$")
 	    parsep))
 	 ;; This is used for searching.
-	 (sp-parstart (concat "^[ \t]*\\(?:" parstart "\\|" parsep "\\)"))
+	 (sp-parstart (concat "\\(?:" parstart "\\|" parsep "\\)"))
 	 start found-start)
     (while (and (< arg 0) (not (bobp)))
       (if (and (not (looking-at parsep))
@@ -278,7 +278,7 @@ forward-paragraph
 		    ;;      multiple-lines
 		    ;;      (forward-line 1))
 		    (not (bobp)))
-		(while (and (re-search-backward sp-parstart nil 1)
+		(while (and (paragraph--line-search nil sp-parstart)
 			    (setq found-start t)
 			    ;; Found a candidate, but need to check if it is a
 			    ;; REAL parstart.
@@ -328,7 +328,7 @@ forward-paragraph
 		      (not (looking-at parsep))
 		      (looking-at fill-prefix-regexp))
 	    (forward-line 1))
-	(while (and (re-search-forward sp-parstart nil 1)
+	(while (and (paragraph--line-search t sp-parstart)
 		    (progn (setq start (match-beginning 0))
 			   (goto-char start)
 			   (not (eobp)))
@@ -344,6 +344,34 @@ forward-paragraph
     ;; Return the number of steps that could not be done.
     arg))
 
+;; We do it this way because the regexp commonly starts with optional
+;; whitespace.
+(defun paragraph--line-search (forward regexp)
+  "Look for REGEXP starting on a line.
+If FORWARD, search forward.  If not, go backward."
+  (catch 'found
+    (while (and (or (and forward
+                         (not (eobp)))
+                    (and (not forward)
+                         (not (bobp))))
+                (funcall (if forward
+                             #'re-search-forward
+                           #'re-search-backward)
+                         regexp nil 1))
+      (save-excursion
+        (goto-char (match-beginning 0))
+        (beginning-of-line)
+        (save-restriction
+          (narrow-to-region
+           (point) (match-beginning 0))
+          (when (looking-at-p "[ \t]*$")
+            (throw 'found t))))
+      (if forward
+          (unless (eobp)
+            (forward-char 1))
+        (unless (bobp)
+          (backward-char 1))))))
+
 (defun backward-paragraph (&optional arg)
   "Move backward to start of paragraph.
 With argument ARG, do it ARG times;


-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply related	[flat|nested] 11+ messages in thread

* bug#19873: Ill-formed regular expression is constructed in forward-paragraph.
  2021-12-02 10:39     ` Lars Ingebrigtsen
  2021-12-02 10:44       ` Lars Ingebrigtsen
@ 2021-12-02 20:45       ` Alan Mackenzie
  2021-12-03 16:15         ` Lars Ingebrigtsen
  1 sibling, 1 reply; 11+ messages in thread
From: Alan Mackenzie @ 2021-12-02 20:45 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: Marcin Borkowski, 19873

Hello, Lars.

On Thu, Dec 02, 2021 at 11:39:51 +0100, Lars Ingebrigtsen wrote:
> Alan Mackenzie <acm@muc.de> writes:

> > I think this idea is workable, but you'll have to check for one or both
> > of paragraph-s{tart,eparate} starting with "[ \t]+".  A good strategy
> > here might be to begin the target regexp with "^[ \t]*", then begin one
> > or both components with "[ \t]" (without the "*").

> > There may be other gotchas which I haven't thought about yet.

> > One needs a twisted mind to do this sort of thing properly, so I offer my
> > services to review your upcoming patch.  ;-)

> The problem seems rather intractable to me.  Is there really any way to
> examine a regexp to determine "does this in practice match [ \t]*"?

Back when the bug was new, I started writing a library to analyse a
regular expression and convert it into an equivalent well formed regular
expression.  It's actually working, but is incomplete.  It's currently
2757 lines long, including pretty complete unit testing.  I actually
looked at it again at the start of November.

> I wonder whether instead of trying to construct a better overall regexp
> could rewrite the loop.  That is, instead of searching for sp-parstart,
> search for parstart "\\|" parsep, and then check whether
> (match-beginning 0) of that comes after "^[ \t]*".  Or something along
> those lines.

> But I don't know whether that'd be any faster in practice.

It strikes me as one of these things which needs to be done
systematically, which, as I said, I've already tried (and not yet given
up).  The question presents itself, would the effort be better spent
improving Emacs's regexp engine?

> Do you have a test case that demonstrates the slowness?  In that case I
> could try to see whether there's any alternate approach here that's
> faster.

Martin Rudalics had the original testcase.  The slowness was exponential
with the number of spaces typed, I think.

> -- 
> (domestic pets only, the antidote for overdose, milk.)
>    bloggy blog: http://lars.ingebrigtsen.no

-- 
Alan Mackenzie (Nuremberg, Germany).





^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#19873: Ill-formed regular expression is constructed in forward-paragraph.
  2021-12-02 20:45       ` Alan Mackenzie
@ 2021-12-03 16:15         ` Lars Ingebrigtsen
  0 siblings, 0 replies; 11+ messages in thread
From: Lars Ingebrigtsen @ 2021-12-03 16:15 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: Marcin Borkowski, 19873

Alan Mackenzie <acm@muc.de> writes:

> Back when the bug was new, I started writing a library to analyse a
> regular expression and convert it into an equivalent well formed regular
> expression.  It's actually working, but is incomplete.  It's currently
> 2757 lines long, including pretty complete unit testing.  I actually
> looked at it again at the start of November.

Interesting.

> It strikes me as one of these things which needs to be done
> systematically, which, as I said, I've already tried (and not yet given
> up).  The question presents itself, would the effort be better spent
> improving Emacs's regexp engine?

Indeed -- making the Emacs regexp engine transform these complex regexps
into simpler, equivalent forms would be great.  We wouldn't have to use
it on all regexps -- the problem usually rears its head when we're
combining several user-defined regexps into a large one, so if we had a
`simplify-regexp' function that we could stick in here, that'd solve the
issue here (and in similar circumstances elsewhere).

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-12-03 16:15 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-15 10:31 bug#19873: Ill-formed regular expression is constructed in forward-paragraph Alan Mackenzie
2017-02-26 16:44 ` Marcin Borkowski
2017-02-26 16:57   ` Eli Zaretskii
2017-02-26 18:48     ` Marcin Borkowski
2017-03-07 16:47   ` Eli Zaretskii
2017-03-09 21:04   ` Alan Mackenzie
2021-12-02 10:39     ` Lars Ingebrigtsen
2021-12-02 10:44       ` Lars Ingebrigtsen
2021-12-02 11:17         ` Lars Ingebrigtsen
2021-12-02 20:45       ` Alan Mackenzie
2021-12-03 16:15         ` Lars Ingebrigtsen

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).