unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: Gregory Heytings <gregory@heytings.org>
To: Stefan Monnier <monnier@iro.umontreal.ca>
Cc: Eli Zaretskii <eliz@gnu.org>, 61514@debbugs.gnu.org, mah@everybody.org
Subject: bug#61514: 30.0.50; sadistically long xml line hangs emacs
Date: Mon, 20 Feb 2023 11:28:44 +0000	[thread overview]
Message-ID: <a32486a84ddfd2f1989c@heytings.org> (raw)
In-Reply-To: <jwv5ybxou1d.fsf-monnier+emacs@gnu.org>

[-- Attachment #1: Type: text/plain, Size: 2385 bytes --]


>> Looking at the history of that variable, which is in fact a 
>> compile-time constant, I see that it was initially (May 1995) set to 
>> 200000.  A few months later (Nov 1995) it was set to 20000, and reduced 
>> again (apparently because of bug reports) to 8000 and to 4000 (both in 
>> Jun 1996).  Two months later it was again set to 20000 (Aug 1996), and 
>> a year later to 40000 (Dec 1997).  It kept that value since then.  As 
>> these changes (and this bug report) demonstrate, it is not possible to 
>> give that variable a "one size fits all" value.
>
> Note that the stack is allocated with `SAFE_ALLOCA` and used to be 
> allocated with just `alloca`.  So the constant was probably reduced 
> (back in the 90s) in response to reports of segfaults due to C stack 
> overflows.
>

Indeed.  But now that we use SAFE_ALLOCA, we fallback to malloc when there 
is not enough room for an alloca, so the constant seems even more 
arbitrary.

>
> Nowadays we should be hopefully(?) safe from such segfaults since 
> `SAFE_ALLOCA` only uses `alloca` for smallish allocations.
>

That's not the case in regex-emacs.c: REGEX_USE_SAFE_ALLOCA sets sa_avail 
to emacs_re_safe_alloca (~6 MiB) instead of its default MAX_ALLOCA value 
(16 KiB).

>
> This really needs a comment (at least one referring to this bug report). 
> I think the idea is that we hope the regexp will need at most one stack 
> entry per character, so the above means that we're willing to limit the 
> regexp search to about 1kB of text, which sounds fair given it's 
> supposed to match just a single XML attribute.
>

Indeed, thanks!

>> +  DEFVAR_INT ("regexp-max-failures", Vregexp_max_failures,
>> +	      doc: /* Maximum number of failures points in a regexp search.  */);
>> +  Vregexp_max_failures = max_regexp_max_failures;
>
> This name is misleading.  It suggests it's talking about how many times 
> we fail, whereas the reality is that it's about the number of pending 
> branches in the search space (which the source code calls "failure 
> points" because it's info to be used in case the current branch fails to 
> match).  It could also be described as the number of "pending 
> continuations" or "stacked failure continuations" or some wording like 
> that.
>
> But for the var name itself, how 'bout `regexp-max-backtracking-depth`?
>

Indeed again, and thanks again!

Updated patch attached.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Make-the-backtracking-depth-of-regexp-searches-modif.patch --]
[-- Type: text/x-diff; name=Make-the-backtracking-depth-of-regexp-searches-modif.patch, Size: 7625 bytes --]

From 4523387ac645d8d6daab07114e29d9386a02450a Mon Sep 17 00:00:00 2001
From: Gregory Heytings <gregory@heytings.org>
Date: Mon, 20 Feb 2023 11:18:30 +0000
Subject: [PATCH] Make the backtracking depth of regexp searches modifiable

* src/search.c (syms_of_search) <regexp-max-backtracking-depth>:
New variable, replacing the constant variable
'emacs_re_max_failures'.  Initialize it with the constant
'max_regexp_max_backtracking_depth'.

* src/regex-emacs.h: Replace the external definition of
'emacs_re_max_failures' with the constant
'max_regexp_max_backtracking_depth'.

* src/regex-emacs.c (GROW_FAIL_STACK): Use the new variable
instead of the constant.  Reset it to its maximum value when
necessary.

* src/emacs.c (main): Use the new constant
'max_regexp_max_backtracking_depth' in the calculations.

* lisp/nxml/xmltok.el (xmltok-scan-attributes): Bind
'regexp-max-backtracking-depth' to a small value, and add a
comment.  Fixes bug#61514.
---
 lisp/nxml/xmltok.el |  7 ++++++-
 src/emacs.c         |  8 ++++----
 src/regex-emacs.c   | 26 +++++++++++---------------
 src/regex-emacs.h   |  7 +++++--
 src/search.c        | 13 +++++++++++++
 5 files changed, 39 insertions(+), 22 deletions(-)

diff --git a/lisp/nxml/xmltok.el b/lisp/nxml/xmltok.el
index c36d225c7c9..8201b773e0f 100644
--- a/lisp/nxml/xmltok.el
+++ b/lisp/nxml/xmltok.el
@@ -731,7 +731,12 @@ xmltok-scan-after-comment-open
 
 (defun xmltok-scan-attributes ()
   (let ((recovering nil)
-	(atts-needing-normalization nil))
+	(atts-needing-normalization nil)
+        ;; Limit the backtracking depth of regexp searches, to fail
+        ;; with a "Stack overflow in regexp matcher" error instead of
+        ;; inflooping in looking-at.  This assumes that XML attributes
+        ;; do not use more than about 1 KB characters.  See bug#61514.
+	(regexp-max-backtracking-depth 1000))
     (while (cond ((or (looking-at (xmltok-attribute regexp))
 		      ;; use non-greedy group
 		      (when (looking-at (concat "[^<>\n]+?"
diff --git a/src/emacs.c b/src/emacs.c
index 214e2e2a296..d0dca3f03ec 100644
--- a/src/emacs.c
+++ b/src/emacs.c
@@ -1499,9 +1499,9 @@ main (int argc, char **argv)
       rlim_t lim = rlim.rlim_cur;
 
       /* Approximate the amount regex-emacs.c needs per unit of
-	 emacs_re_max_failures, then add 33% to cover the size of the
-	 smaller stacks that regex-emacs.c successively allocates and
-	 discards on its way to the maximum.  */
+	 max_regexp_max_backtracking_depth, then add 33% to cover the
+	 size of the smaller stacks that regex-emacs.c successively
+	 allocates and discards on its way to the maximum.  */
       int min_ratio = 20 * sizeof (char *);
       int ratio = min_ratio + min_ratio / 3;
 
@@ -1514,7 +1514,7 @@ main (int argc, char **argv)
 
       if (try_to_grow_stack)
 	{
-	  rlim_t newlim = emacs_re_max_failures * ratio + extra;
+	  rlim_t newlim = max_regexp_max_backtracking_depth * ratio + extra;
 
 	  /* Round the new limit to a page boundary; this is needed
 	     for Darwin kernel 15.4.0 (see Bug#23622) and perhaps
diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index 2dca0d16ad9..931db980e39 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -868,17 +868,6 @@ print_double_string (re_char *where, re_char *string1, ptrdiff_t size1,
    space, so it is not a hard limit.  */
 #define INIT_FAILURE_ALLOC 20
 
-/* Roughly the maximum number of failure points on the stack.  Would be
-   exactly that if failure always used TYPICAL_FAILURE_SIZE items.
-   This is a variable only so users of regex can assign to it; we never
-   change it ourselves.  We always multiply it by TYPICAL_FAILURE_SIZE
-   before using it, so it should probably be a byte-count instead.  */
-/* Note that 4400 was enough to cause a crash on Alpha OSF/1,
-   whose default stack limit is 2mb.  In order for a larger
-   value to work reliably, you have to try to make it accord
-   with the process stack limit.  */
-ptrdiff_t emacs_re_max_failures = 40000;
-
 union fail_stack_elt
 {
   re_char *pointer;
@@ -912,7 +901,7 @@ #define INIT_FAIL_STACK()						\
 
 
 /* Double the size of FAIL_STACK, up to a limit
-   which allows approximately 'emacs_re_max_failures' items.
+   which allows approximately 'Vregexp_max_backtracking_depth' items.
 
    Return 1 if succeeds, and 0 if either ran out of memory
    allocating space for it or it was already too large.
@@ -926,16 +915,23 @@ #define INIT_FAIL_STACK()						\
 #define FAIL_STACK_GROWTH_FACTOR 4
 
 #define GROW_FAIL_STACK(fail_stack)					\
-  (((fail_stack).size >= emacs_re_max_failures * TYPICAL_FAILURE_SIZE)        \
+  ((Vregexp_max_backtracking_depth =					\
+    Vregexp_max_backtracking_depth <= 0					\
+    || Vregexp_max_backtracking_depth					\
+       > max_regexp_max_backtracking_depth				\
+    ? max_regexp_max_backtracking_depth					\
+    : Vregexp_max_backtracking_depth),					\
+   ((fail_stack).size							\
+    >= Vregexp_max_backtracking_depth * TYPICAL_FAILURE_SIZE)		\
    ? 0									\
    : ((fail_stack).stack						\
       = REGEX_REALLOCATE ((fail_stack).stack,				\
 	  (fail_stack).avail * sizeof (fail_stack_elt_t),		\
-          min (emacs_re_max_failures * TYPICAL_FAILURE_SIZE,                  \
+          min (Vregexp_max_backtracking_depth * TYPICAL_FAILURE_SIZE,   \
                ((fail_stack).size * FAIL_STACK_GROWTH_FACTOR))          \
           * sizeof (fail_stack_elt_t)),                                 \
       ((fail_stack).size						\
-       = (min (emacs_re_max_failures * TYPICAL_FAILURE_SIZE,		\
+       = (min (Vregexp_max_backtracking_depth * TYPICAL_FAILURE_SIZE,	\
 	       ((fail_stack).size * FAIL_STACK_GROWTH_FACTOR)))),	\
       1))
 
diff --git a/src/regex-emacs.h b/src/regex-emacs.h
index 1bc973363e9..9ccc4177487 100644
--- a/src/regex-emacs.h
+++ b/src/regex-emacs.h
@@ -49,8 +49,11 @@ #define EMACS_REGEX_H 1
    TODO: turn into an actual function parameter.  */
 extern Lisp_Object re_match_object;
 
-/* Roughly the maximum number of failure points on the stack.  */
-extern ptrdiff_t emacs_re_max_failures;
+/* Maximum value for Vregexp_max_backtracking_depth.  This is roughly
+   the maximum allowed number of failure points on the stack.  It
+   would be exactly that if failure always used TYPICAL_FAILURE_SIZE
+   items.  */
+#define max_regexp_max_backtracking_depth 40000
 
 /* Amount of memory that we can safely stack allocate.  */
 extern ptrdiff_t emacs_re_safe_alloca;
diff --git a/src/search.c b/src/search.c
index 0bb52c03eef..fc5d7c2b8e2 100644
--- a/src/search.c
+++ b/src/search.c
@@ -3431,6 +3431,19 @@ syms_of_search (void)
 is to bind it with `let' around a small expression.  */);
   Vinhibit_changing_match_data = Qnil;
 
+  DEFVAR_INT ("regexp-max-backtracking-depth", Vregexp_max_backtracking_depth,
+	      doc: /* Maximum backtracking depth in a regexp search.
+
+When the number of pending branches in the search space reaches that
+threshold, a regexp search fails with a "Stack overflow in regexp
+matcher".  Roughly speaking, this is the number of characters to which
+a regexp search is limited, with a complex enough regexp.
+
+Note that this variable will be reset to its default value if it is
+set to a non-positive value, or to a higher value than its default
+value.  */);
+  Vregexp_max_backtracking_depth = max_regexp_max_backtracking_depth;
+
   defsubr (&Slooking_at);
   defsubr (&Sposix_looking_at);
   defsubr (&Sstring_match);
-- 
2.39.0


  reply	other threads:[~2023-02-20 11:28 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-14 21:02 bug#61514: 30.0.50; sadistically long xml line hangs emacs Mark A. Hershberger via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-02-14 22:05 ` Gregory Heytings
2023-02-15  1:04   ` Mark A. Hershberger
2023-02-15  8:39     ` Gregory Heytings
2023-02-15 10:24       ` Po Lu via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-02-15 10:41         ` Gregory Heytings
2023-02-15 10:52           ` Po Lu via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-02-15 10:59             ` Gregory Heytings
2023-02-15 11:52               ` Po Lu via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-02-15 12:11                 ` Gregory Heytings
2023-02-15 12:54                   ` Po Lu via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-02-15 13:31                     ` Gregory Heytings
2023-02-15 13:56                 ` Eli Zaretskii
2023-02-15 12:20       ` Dmitry Gutov
2023-02-15 13:58         ` Gregory Heytings
2023-02-15 14:17           ` Eli Zaretskii
2023-02-15 14:34             ` Gregory Heytings
2023-02-18 16:22 ` Eli Zaretskii
2023-02-18 17:06   ` Mark A. Hershberger
2023-02-18 17:58     ` Eli Zaretskii
2023-02-18 23:06   ` Gregory Heytings
2023-02-19  0:46     ` Gregory Heytings
2023-02-19  6:42       ` Eli Zaretskii
2023-02-19 23:12         ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-02-19 23:48         ` Gregory Heytings
2023-02-19 23:58           ` Gregory Heytings
2023-02-20  2:05             ` Gregory Heytings
2023-02-20  4:24               ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-02-20 11:28                 ` Gregory Heytings [this message]
2023-02-20 12:33               ` Eli Zaretskii
2023-02-20 12:31             ` Eli Zaretskii
2023-02-20 12:40               ` Gregory Heytings
2023-02-20 13:14                 ` Eli Zaretskii
2023-02-20 14:17                   ` Gregory Heytings
2023-02-20  0:14           ` Gregory Heytings
2023-02-20 12:32             ` Eli Zaretskii
2023-02-19 23:48   ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-02-20 12:19     ` Eli Zaretskii
2023-02-20 13:19       ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-02-20 13:54         ` Eli Zaretskii
2023-02-20 14:59           ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-02-20 15:56             ` Gregory Heytings
2023-02-20 16:47               ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-02-20 17:14                 ` Gregory Heytings
2023-02-20 17:34                   ` Gregory Heytings
2023-02-20 18:49                 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-02-20 19:11                   ` Gregory Heytings
2023-02-20 19:29                     ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-02-20 19:37                     ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-02-20 20:13                       ` Gregory Heytings
2023-02-21 12:05                         ` Eli Zaretskii
2023-02-21 12:37                           ` Gregory Heytings
2023-02-21 13:07                             ` Eli Zaretskii
2023-02-21 14:38                               ` Gregory Heytings
2023-02-21 14:48                                 ` Eli Zaretskii
2023-02-21 15:25                                   ` Gregory Heytings
2023-02-21 15:44                                     ` Gregory Heytings
2023-02-21 16:58                                       ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-03-18 10:59                                         ` Gregory Heytings
2023-03-18 11:10                                           ` Eli Zaretskii
2023-03-18 15:06                                             ` Gregory Heytings
2023-03-19  2:39                                           ` mah via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-02-21 13:24                             ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-02-21 13:35                               ` Gregory Heytings
2023-02-20 20:01                   ` Eli Zaretskii
2023-02-21  2:23                     ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-02-21  9:39                       ` Gregory Heytings
2023-02-21 12:44                         ` Eli Zaretskii
2023-02-20 17:04               ` Gregory Heytings
2023-02-20 14:06         ` Gregory Heytings
2023-02-20 14:16           ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-02-20 14:24             ` Gregory Heytings
2023-02-20 15:02               ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-02-19 23:38 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-02-20 12:41   ` Eli Zaretskii

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a32486a84ddfd2f1989c@heytings.org \
    --to=gregory@heytings.org \
    --cc=61514@debbugs.gnu.org \
    --cc=eliz@gnu.org \
    --cc=mah@everybody.org \
    --cc=monnier@iro.umontreal.ca \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).