unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: Noam Postavsky <npostavs@users.sourceforge.net>
To: Eli Zaretskii <eliz@gnu.org>
Cc: 24914@debbugs.gnu.org
Subject: bug#24914: 24.5; isearch-regexp: wrong error message
Date: Sat, 09 Dec 2017 21:18:05 -0500	[thread overview]
Message-ID: <87vahfe36q.fsf@users.sourceforge.net> (raw)
In-Reply-To: <83bmj9wan8.fsf@gnu.org> (Eli Zaretskii's message of "Fri, 08 Dec 2017 16:35:07 +0200")

[-- Attachment #1: Type: text/plain, Size: 2773 bytes --]

Eli Zaretskii <eliz@gnu.org> writes:

>> I thought it would be easier to document the limit if it's fixed across
>> all machines.  Otherwise we would have to say something like "For both
>> forms, m and n, if specified, may be no larger than INT_MAX, which is
>> usually 2**31 - 1, but could be 2**63 - 1 depending on the compiler used
>> for building Emacs".
>
> Isn't int 32 bit wide everywhere?

I might have been mixing up int with long when I was thinking about
this; it seems only a few very obscure platforms have 64 bit ints.
According to [1], everywhere but "HAL Computer Systems port of Solaris
to the SPARC64" and "Classic UNICOS" has 32 bit ints.

[1]: https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models

> And anyway, since the bitmap is stored in an int, isn't INT_MAX TRT?

Unfortunately, all this discussion of int size seems to be academic.  I
took another look at the code, there is another limit due to regexp
opcode format.  We can raise the limit to 2^16-1 though.

Here is the use of RE_DUP_MAX, which makes it seem like int-size is the
main limit:

    /* Get the next unsigned number in the uncompiled pattern.  */
    #define GET_INTERVAL_COUNT(num)					\
      ...
                if (RE_DUP_MAX / 10 - (RE_DUP_MAX % 10 < c - '0') < num)	\
                  FREE_STACK_RETURN (REG_ESIZEBR);				\


    static reg_errcode_t
    regex_compile (const_re_char *pattern, size_t size,
    {
      ...
		int lower_bound = 0, upper_bound = -1;
                [...]
		GET_INTERVAL_COUNT (lower_bound);

But then

			INSERT_JUMP2 (succeed_n, laststart,
				      b + 5 + nbytes,
				      lower_bound);

    /* Like `STORE_JUMP2', but for inserting.  Assume `b' is the buffer end.  */
    #define INSERT_JUMP2(op, loc, to, arg) \
      insert_op2 (op, loc, (to) - (loc) - 3, arg, b)

    /* Like `insert_op1', but for two two-byte parameters ARG1 and ARG2.  */
                                      ^^^^^^^^
    static void
    insert_op2 (re_opcode_t op, unsigned char *loc, int arg1, int arg2, unsigned char *end)
    {
      ...
      store_op2 (op, loc, arg1, arg2);
    }

    /* Like `store_op1', but for two two-byte parameters ARG1 and ARG2.  */
                                     ^^^^^^^^
    static void
    store_op2 (re_opcode_t op, unsigned char *loc, int arg1, int arg2)
    {
      *loc = (unsigned char) op;
      STORE_NUMBER (loc + 1, arg1);
      STORE_NUMBER (loc + 3, arg2);
    }

    /* Store NUMBER in two contiguous bytes starting at DESTINATION.  */
                       ^^^^^^^^^^^^^^^^^^^^

    #define STORE_NUMBER(destination, number)				\
      do {									\
        (destination)[0] = (number) & 0377;					\
        (destination)[1] = (number) >> 8;					\
      } while (0)


Here is the updated patch:


[-- Attachment #2: patch --]
[-- Type: text/plain, Size: 6410 bytes --]

From 6c3ead6bd5c61801915dcedbb8dd17622610a899 Mon Sep 17 00:00:00 2001
From: Noam Postavsky <npostavs@gmail.com>
Date: Sat, 2 Dec 2017 19:01:54 -0500
Subject: [PATCH] Raise limit of regexp repetition (Bug#24914)

* src/regex.h (RE_DUP_MAX): Raise limit to 2^16-1.
* etc/NEWS: Announce it.
* doc/lispref/searching.texi (Regexp Backslash): Document it.
* test/src/regex-tests.el (regex-repeat-limit): Test it.

* src/regex.h (reg_errcode_t): Add REG_ESIZEBR code.
* src/regex.c (re_error_msgid): Add corresponding entry.
(GET_INTERVAL_COUNT): Return it instead of the more generic REG_EBADBR
when encountering a repetition greater than RE_DUP_MAX.

* lisp/isearch.el (isearch-search): Don't convert errors starting with
"Invalid" into "incomplete".  Such errors are not incomplete, in the
sense that they cannot be corrected by appending more characters to
the end of the regexp.  The affected error messages are:

- REG_BADPAT "Invalid regular expression"
  - \\(?X:\\) where X is not a legal group number
  - \\_X where X is not < or >

- REG_ECOLLATE "Invalid collation character"
  - There is no code to throw this.

- REG_ECTYPE "Invalid character class name"
  - [[:foo:] where foo is not a valid class name

- REG_ESUBREG "Invalid back reference"
  - \N where N is referenced before matching group N

- REG_BADBR "Invalid content of \\{\\}"
  - \\{N,M\\} where N < 0, M < N, M or N larger than max
  - \\{NX where X is not a digit or backslash
  - \\{N\\X where X is not a }

- REG_ERANGE "Invalid range end"
  - There is no code to throw this.

- REG_BADRPT "Invalid preceding regular expression"
  - We never throw this.  It would usually indicate a "*" with no
    preceding regexp text, but Emacs allows that to match a literal
    "*".
---
 doc/lispref/searching.texi | 10 +++++++++-
 etc/NEWS                   |  8 ++++++++
 lisp/isearch.el            |  2 +-
 src/regex.c                |  5 +++--
 src/regex.h                |  9 ++++++---
 test/src/regex-tests.el    |  6 ++++++
 6 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 755fa554bb..ab52cf2802 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -639,7 +639,15 @@ Regexp Backslash
 is a more general postfix operator that specifies repetition with a
 minimum of @var{m} repeats and a maximum of @var{n} repeats.  If @var{m}
 is omitted, the minimum is 0; if @var{n} is omitted, there is no
-maximum.
+maximum.  For both forms, @var{m} and @var{n}, if specified, may be no
+larger than
+@ifnottex
+2**16 @minus{} 1
+@end ifnottex
+@tex
+@math{2^{16}-1}
+@end tex
+.
 
 For example, @samp{c[ad]\@{1,2\@}r} matches the strings @samp{car},
 @samp{cdr}, @samp{caar}, @samp{cadr}, @samp{cdar}, and @samp{cddr}, and
diff --git a/etc/NEWS b/etc/NEWS
index 64b53d88c8..c7efc53f6a 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -509,6 +509,14 @@ instead.
 ** The new user option 'arabic-shaper-ZWNJ-handling' controls how to
 handle ZWNJ in Arabic text rendering.
 
++++
+** The limit on repetitions in regexps has been raised to 2^16-1.
+It was previously undocumented and limited to 2^15-1.  For example,
+the following regular expression was previously invalid, but is now
+accepted:
+
+   x\{32768\}
+
 \f
 * Editing Changes in Emacs 26.1
 
diff --git a/lisp/isearch.el b/lisp/isearch.el
index 13fa97ea71..093185a096 100644
--- a/lisp/isearch.el
+++ b/lisp/isearch.el
@@ -2851,7 +2851,7 @@ isearch-search
      (setq isearch-error (car (cdr lossage)))
      (cond
       ((string-match
-	"\\`Premature \\|\\`Unmatched \\|\\`Invalid "
+	"\\`Premature \\|\\`Unmatched "
 	isearch-error)
        (setq isearch-error "incomplete input"))
       ((and (not isearch-regexp)
diff --git a/src/regex.c b/src/regex.c
index 330f2f78a8..ab74f457d4 100644
--- a/src/regex.c
+++ b/src/regex.c
@@ -1200,7 +1200,8 @@ WEAK_ALIAS (__re_set_syntax, re_set_syntax)
     gettext_noop ("Premature end of regular expression"), /* REG_EEND */
     gettext_noop ("Regular expression too big"), /* REG_ESIZE */
     gettext_noop ("Unmatched ) or \\)"), /* REG_ERPAREN */
-    gettext_noop ("Range striding over charsets") /* REG_ERANGEX  */
+    gettext_noop ("Range striding over charsets"), /* REG_ERANGEX  */
+    gettext_noop ("Invalid content of \\{\\}, repetitions too big") /* REG_ESIZEBR  */
   };
 \f
 /* Whether to allocate memory during matching.  */
@@ -1921,7 +1922,7 @@ while (REMAINING_AVAIL_SLOTS <= space) {				\
 	    if (num < 0)						\
 	      num = 0;							\
 	    if (RE_DUP_MAX / 10 - (RE_DUP_MAX % 10 < c - '0') < num)	\
-	      FREE_STACK_RETURN (REG_BADBR);				\
+	      FREE_STACK_RETURN (REG_ESIZEBR);				\
 	    num = num * 10 + c - '0';					\
 	    if (p == pend)						\
 	      FREE_STACK_RETURN (REG_EBRACE);				\
diff --git a/src/regex.h b/src/regex.h
index 9fa8356011..4c8632d6aa 100644
--- a/src/regex.h
+++ b/src/regex.h
@@ -270,8 +270,10 @@
 #ifdef RE_DUP_MAX
 # undef RE_DUP_MAX
 #endif
-/* If sizeof(int) == 2, then ((1 << 15) - 1) overflows.  */
-#define RE_DUP_MAX (0x7fff)
+/* Repeat counts are stored in opcodes as 2 byte integers.  This was
+   previously limited to 7fff because the parsing code uses signed
+   ints.  But Emacs only runs on 32 bit platforms anyway.  */
+#define RE_DUP_MAX (0xffff)
 
 
 /* POSIX `cflags' bits (i.e., information for `regcomp').  */
@@ -337,7 +339,8 @@
   REG_EEND,		/* Premature end.  */
   REG_ESIZE,		/* Compiled pattern bigger than 2^16 bytes.  */
   REG_ERPAREN,		/* Unmatched ) or \); not returned from regcomp.  */
-  REG_ERANGEX		/* Range striding over charsets.  */
+  REG_ERANGEX,		/* Range striding over charsets.  */
+  REG_ESIZEBR           /* n or m too big in \{n,m\} */
 } reg_errcode_t;
 \f
 /* This data structure represents a compiled pattern.  Before calling
diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el
index b1f1ea71ce..872d16a085 100644
--- a/test/src/regex-tests.el
+++ b/test/src/regex-tests.el
@@ -677,4 +677,10 @@ regex-tests-TESTS
 This evaluates the TESTS test cases from glibc."
   (should-not (regex-tests-TESTS)))
 
+(ert-deftest regex-repeat-limit ()
+  "Test the #xFFFF repeat limit."
+  (should (string-match "\\`x\\{65535\\}" (make-string 65535 ?x)))
+  (should-not (string-match "\\`x\\{65535\\}" (make-string 65534 ?x)))
+  (should-error (string-match "\\`x\\{65536\\}" "X") :type invalid-regexp))
+
 ;;; regex-tests.el ends here
-- 
2.11.0


  reply	other threads:[~2017-12-10  2:18 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-09 22:29 bug#24914: 24.5; isearch-regexp: wrong error message Drew Adams
2017-12-03 16:37 ` Noam Postavsky
2017-12-03 18:00   ` Drew Adams
2017-12-03 18:13     ` Noam Postavsky
2017-12-03 18:56       ` Drew Adams
2017-12-04  6:27         ` Noam Postavsky
2017-12-04 14:52           ` Drew Adams
2017-12-05  1:18             ` Noam Postavsky
2017-12-05  3:15               ` Drew Adams
2017-12-05  3:51                 ` Noam Postavsky
2017-12-05  4:52                   ` Drew Adams
2017-12-05 13:27                     ` Noam Postavsky
2017-12-05 15:31                       ` Drew Adams
2017-12-06  2:52                         ` Noam Postavsky
2017-12-08  9:48               ` Eli Zaretskii
2017-12-08 13:32                 ` Noam Postavsky
2017-12-08 14:35                   ` Eli Zaretskii
2017-12-10  2:18                     ` Noam Postavsky [this message]
2017-12-10  6:49                       ` Eli Zaretskii
2018-01-27  2:05                         ` Noam Postavsky
2017-12-04 15:18           ` Eli Zaretskii

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87vahfe36q.fsf@users.sourceforge.net \
    --to=npostavs@users.sourceforge.net \
    --cc=24914@debbugs.gnu.org \
    --cc=eliz@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).