bug#24914: 24.5; isearch-regexp: wrong error message

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#24914: 24.5; isearch-regexp: wrong error message
@ 2016-11-09 22:29 Drew Adams
  2017-12-03 16:37 ` Noam Postavsky
  0 siblings, 1 reply; 21+ messages in thread
From: Drew Adams @ 2016-11-09 22:29 UTC (permalink / raw)
  To: 24914

Do this in a large file:

1. C-M-s \(.\|^J\)+

That shows the error message: [error Stack overflow in regexp matcher].
OK, understandable.

2. C-M-s \(.\|^J\)\{,4000\}, where ^J is really a newline char, typed by
using `C-q C-j'.

No problem with that search.

3. C-M-s \(.\|^J\)\{,40000\}

That shows the error message: [incomplete input], which is wrong, IMO.

In GNU Emacs 24.5.1 (i686-pc-mingw32)
 of 2015-04-11 on LEG570
Windowing system distributor `Microsoft Corp.', version 6.1.7601
Configured using:
 `configure --prefix=/c/usr --host=i686-pc-mingw32'





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#24914: 24.5; isearch-regexp: wrong error message
  2016-11-09 22:29 bug#24914: 24.5; isearch-regexp: wrong error message Drew Adams
@ 2017-12-03 16:37 ` Noam Postavsky
  2017-12-03 18:00   ` Drew Adams
  0 siblings, 1 reply; 21+ messages in thread
From: Noam Postavsky @ 2017-12-03 16:37 UTC (permalink / raw)
  To: Drew Adams; +Cc: 24914

[-- Attachment #1: Type: text/plain, Size: 500 bytes --]

Drew Adams <drew.adams@oracle.com> writes:

> 3. C-M-s \(.\|^J\)\{,40000\}
>
> That shows the error message: [incomplete input], which is wrong, IMO.

The reason it doesn't work is because the number of repitions is limited
to 32767 (#x7fff).  Obviously that should be documented in the manual.

As to the error message itself, there isn't really a way to distinguish
between incomplete and invalid input, so the only thing I can see to do
is to change the message to [incomplete or invalid input].


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: patch --]
[-- Type: text/x-diff, Size: 1697 bytes --]

From f7bb728281408170cfe79005b03d2b382a84cdbd Mon Sep 17 00:00:00 2001
From: Noam Postavsky <npostavs@gmail.com>
Date: Sat, 2 Dec 2017 19:01:54 -0500
Subject: [PATCH] Document limitation of regexp repetition (Bug#24914)

* doc/lispref/searching.texi (Regexp Backslash): Explain that \{m,n\}
may only use numbers up to 32767.
* lisp/isearch.el (isearch-search): Update error message to include
invalid input possibility.
---
 doc/lispref/searching.texi | 3 ++-
 lisp/isearch.el            | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 755fa554bb..92b7e6d17e 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -639,7 +639,8 @@ Regexp Backslash
 is a more general postfix operator that specifies repetition with a
 minimum of @var{m} repeats and a maximum of @var{n} repeats.  If @var{m}
 is omitted, the minimum is 0; if @var{n} is omitted, there is no
-maximum.
+maximum.  For both forms, @var{m} and @var{n}, if specified, may be no
+larger than 32767.
 
 For example, @samp{c[ad]\@{1,2\@}r} matches the strings @samp{car},
 @samp{cdr}, @samp{caar}, @samp{cadr}, @samp{cdar}, and @samp{cddr}, and
diff --git a/lisp/isearch.el b/lisp/isearch.el
index 13fa97ea71..dfc5f9f3f7 100644
--- a/lisp/isearch.el
+++ b/lisp/isearch.el
@@ -2853,7 +2853,7 @@ isearch-search
       ((string-match
 	"\\`Premature \\|\\`Unmatched \\|\\`Invalid "
 	isearch-error)
-       (setq isearch-error "incomplete input"))
+       (setq isearch-error "incomplete or invalid input"))
       ((and (not isearch-regexp)
 	    (string-match "\\`Regular expression too big" isearch-error))
        (cond
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* bug#24914: 24.5; isearch-regexp: wrong error message
  2017-12-03 16:37 ` Noam Postavsky
@ 2017-12-03 18:00   ` Drew Adams
  2017-12-03 18:13     ` Noam Postavsky
  0 siblings, 1 reply; 21+ messages in thread
From: Drew Adams @ 2017-12-03 18:00 UTC (permalink / raw)
  To: Noam Postavsky; +Cc: 24914

> > 3. C-M-s \(.\|^J\)\{,40000\}
> > That shows the error message: [incomplete input],
> > which is wrong, IMO.
> 
> The reason it doesn't work is because the number of repitions
> is limited to 32767 (#x7fff).

Yet another case for adding bignums to Emacs Lisp?
I imagine someone will answer that there needs to be
a limit.

In that case, can we not use something larger?
Could we use the value of `most-positive-fixnum'?

> Obviously that should be documented in the manual.

Yes, please.

> As to the error message itself, there isn't really a way
> to distinguish between incomplete and invalid input,

We do that in some places in the code.  Some code
parses the regexp, and that code must know (or be
able to know) both that the regexp is not incomplete
and that the numeral given for the number of repetitions
is too large.

> so the only thing I can see to do
> is to change the message to [incomplete or invalid input].

I suppose that's better than [incomplete], but it doesn't
really help users very much.

Can we please do that but keep this bug open, hoping that
someone will someday provide a real fix?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#24914: 24.5; isearch-regexp: wrong error message
  2017-12-03 18:00   ` Drew Adams
@ 2017-12-03 18:13     ` Noam Postavsky
  2017-12-03 18:56       ` Drew Adams
  0 siblings, 1 reply; 21+ messages in thread
From: Noam Postavsky @ 2017-12-03 18:13 UTC (permalink / raw)
  To: Drew Adams; +Cc: 24914

Drew Adams <drew.adams@oracle.com> writes:

>> > 3. C-M-s \(.\|^J\)\{,40000\}
>> > That shows the error message: [incomplete input],
>> > which is wrong, IMO.
>> 
>> The reason it doesn't work is because the number of repitions
>> is limited to 32767 (#x7fff).
>
> Yet another case for adding bignums to Emacs Lisp?
> I imagine someone will answer that there needs to be
> a limit.
>
> In that case, can we not use something larger?
> Could we use the value of `most-positive-fixnum'?

It's not a limit in Lisp, but in regex.c.

>> As to the error message itself, there isn't really a way
>> to distinguish between incomplete and invalid input,
>
> We do that in some places in the code.

What places are those?

> Some code parses the regexp, and that code must know (or be able to
> know) both that the regexp is not incomplete

What does it mean for a regexp to be incomplete or not?  As far as I can
tell, the only distinction is that the user means to type more; but the
code doesn't know what will happen in the future...

> and that the numeral
> given for the number of repetitions is too large.

I suppose we could change regex.c to give a different error message for
a repetition number that is too high, and then isearch.el could check
for that specially.





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#24914: 24.5; isearch-regexp: wrong error message
  2017-12-03 18:13     ` Noam Postavsky
@ 2017-12-03 18:56       ` Drew Adams
  2017-12-04  6:27         ` Noam Postavsky
  0 siblings, 1 reply; 21+ messages in thread
From: Drew Adams @ 2017-12-03 18:56 UTC (permalink / raw)
  To: Noam Postavsky; +Cc: 24914

> >> > 3. C-M-s \(.\|^J\)\{,40000\}
> >> > That shows the error message: [incomplete input],
> >> > which is wrong, IMO.
> >>
> >> The reason it doesn't work is because the number of repitions
> >> is limited to 32767 (#x7fff).
> >
> > Yet another case for adding bignums to Emacs Lisp?
> > I imagine someone will answer that there needs to be
> > a limit.
> >
> > In that case, can we not use something larger?
> > Could we use the value of `most-positive-fixnum'?
> 
> It's not a limit in Lisp, but in regex.c.

We can't use something larger there?

> >> As to the error message itself, there isn't really a way
> >> to distinguish between incomplete and invalid input,
> >
> > We do that in some places in the code.
> 
> What places are those?

In the Lisp code, at least, there are a few places where
we provide an error that is specific to an invalid regexp.
Search for handling of standard error `invalid-regexp',
for instance.

But if this is handled only in C code then you might want
to look there instead.

> > Some code parses the regexp, and that code must know (or be able to
> > know) both that the regexp is not incomplete
> 
> What does it mean for a regexp to be incomplete or not?  As far as I can
> tell, the only distinction is that the user means to type more; but the
> code doesn't know what will happen in the future...

Presumably that term is used only for cases where we can
be sure that in order for the regexp to be valid there
would need to be further input.  `foo' is not incomplete,
whether or not the user "means to type more".  `[^' is
incomplete, because it can be made valid only by typing
more.

> > and that the numeral
> > given for the number of repetitions is too large.
> 
> I suppose we could change regex.c to give a different error message for
> a repetition number that is too high, and then isearch.el could check
> for that specially.

That would be great.





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#24914: 24.5; isearch-regexp: wrong error message
  2017-12-03 18:56       ` Drew Adams
@ 2017-12-04  6:27         ` Noam Postavsky
  2017-12-04 14:52           ` Drew Adams
  2017-12-04 15:18           ` Eli Zaretskii
  0 siblings, 2 replies; 21+ messages in thread
From: Noam Postavsky @ 2017-12-04  6:27 UTC (permalink / raw)
  To: Drew Adams; +Cc: 24914

Drew Adams <drew.adams@oracle.com> writes:

>> It's not a limit in Lisp, but in regex.c.
>
> We can't use something larger there?

Hmm, right, actually I see in regex.h:

    /* If sizeof(int) == 2, then ((1 << 15) - 1) overflows.  */
    #define RE_DUP_MAX (0x7fff)

Does Emacs even support 16 bit platforms?

>> >> As to the error message itself, there isn't really a way
>> >> to distinguish between incomplete and invalid input,
>> >
>> > We do that in some places in the code.
>> 
>> What places are those?
>
> In the Lisp code, at least, there are a few places where
> we provide an error that is specific to an invalid regexp.
> Search for handling of standard error `invalid-regexp',
> for instance.

As far as I can tell, none of those places (apart from isearch.el, the
subject of this bug) try to flag "incomplete" regexps, only invalid or
valid.

>> > Some code parses the regexp, and that code must know (or be able to
>> > know) both that the regexp is not incomplete
>> 
>> What does it mean for a regexp to be incomplete or not?  As far as I can
>> tell, the only distinction is that the user means to type more; but the
>> code doesn't know what will happen in the future...
>
> Presumably that term is used only for cases where we can
> be sure that in order for the regexp to be valid there
> would need to be further input.  `foo' is not incomplete,
> whether or not the user "means to type more".  `[^' is
> incomplete, because it can be made valid only by typing
> more.

Is `\\{100,20\\}' incomplete?  Because it could be made valid by the user
adding a 0 after the 20 to become '\\{100,200\\}'.

Actually, I'm wondering what's the point of isearch showing "incomplete"
instead of the actual regexp invalid error.  I.e., why not instead of

    \ [incomplete]
    \{ [incomplete]
    \{4 [incomplete]
    \{4000 [incomplete]
    \{4000\ [incomplete]
    \{4000\}

show this:

    \ [Trailing backslash]
    \{ [Unmatched \{]
    \{4 [Unmatched \{]
    \{4000 [Unmatched \{]
    \{4000\ [Trailing backslash]
    \{4000\}





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#24914: 24.5; isearch-regexp: wrong error message
  2017-12-04  6:27         ` Noam Postavsky
@ 2017-12-04 14:52           ` Drew Adams
  2017-12-05  1:18             ` Noam Postavsky
  2017-12-04 15:18           ` Eli Zaretskii
  1 sibling, 1 reply; 21+ messages in thread
From: Drew Adams @ 2017-12-04 14:52 UTC (permalink / raw)
  To: Noam Postavsky; +Cc: 24914

> >> >> As to the error message itself, there isn't really a way
> >> >> to distinguish between incomplete and invalid input,
> >> >
> >> > We do that in some places in the code.
> >>
> >> What places are those?
> >
> > In the Lisp code, at least, there are a few places where
> > we provide an error that is specific to an invalid regexp.
> > Search for handling of standard error `invalid-regexp',
> > for instance.
> 
> As far as I can tell, none of those places (apart from isearch.el, the
> subject of this bug) try to flag "incomplete" regexps, only invalid or
> valid.

Isn't that the point?  In the case in question the regexp is
not incomplete.  It is "invalid" because the occurrences
count is too high.  Showing a message that says it is
incomplete is wrong - that was the point of this report.

What I cited are cases where we do flag _particular kinds_
of invalid regexps, and so tailor the error msg.  That's
what could be hoped for in the current case too: ideally,
show a msg that says that the occurrences count is too high.
If that can't be detected exactly then perhaps we can get
close - e.g., invalid occurrences count or some such.

> >> > Some code parses the regexp, and that code must know
> >> > (or be able to know) both that the regexp is not incomplete
> >>
> >> What does it mean for a regexp to be incomplete or not?
> >> As far as I can tell, the only distinction is that the
> >> user means to type more; but the code doesn't know what
> >> will happen in the future...
> >
> > Presumably that term is used only for cases where we can
> > be sure that in order for the regexp to be valid there
> > would need to be further input.  `foo' is not incomplete,
> > whether or not the user "means to type more".  `[^' is
> > incomplete, because it can be made valid only by typing
> > more.
> 
> Is `\\{100,20\\}' incomplete?  Because it could be made valid
> by the user adding a 0 after the 20 to become '\\{100,200\\}'.

Of course, a user could always use `M-e' to edit the search
pattern and type 0 before the \\}.  But our isearch messages
don't take that kind of thing into account.  They assume the
cursor is at the _end_ of the search pattern, so that further
input is appended to the pattern.

An incomplete-regexp message means (so far, aside from bugs
like this one or perhaps cases where Emacs cannot do better)
that we expect you to keep typing - at the end of the search
pattern, to complete a valid regexp.

> Actually, I'm wondering what's the point of isearch showing
> "incomplete" instead of the actual regexp invalid error.
> I.e., why not instead of
> 
>     \ [incomplete]
>     \{ [incomplete]
>     \{4 [incomplete]
>     \{4000 [incomplete]
>     \{4000\ [incomplete]
>     \{4000\}
> 
> show this:
> 
>     \ [Trailing backslash]
>     \{ [Unmatched \{]
>     \{4 [Unmatched \{]
>     \{4000 [Unmatched \{]
>     \{4000\ [Trailing backslash]
>     \{4000\}

Feel free to work on that.  You might run into some cases
that are not so clear-cut.  But you might well improve
things generally in some way.

The problem with that is (I suppose) that it is not, in
general, straightforward what would be needed to make
the current pattern a valid regexp.

In particular, there might be multiple ways to make it
valid.  Trying to describe what you're expecting, as
possible appended input that would make for a valid
regexp, would be hard.

And doing it accurately, even when feasible, would lead
to complex error msgs.  It's maybe more user-friendly
to just indicate that, so far, the regexp is not valid,
but that it could become valid by appending something
(i.e., without trying to accurately characterize that
something).

Anyway, unless working on that is needed or appropriate
for fixing the reported bug, that should perhaps be
dealt with by a separate bug (enhancement request).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#24914: 24.5; isearch-regexp: wrong error message
  2017-12-04  6:27         ` Noam Postavsky
  2017-12-04 14:52           ` Drew Adams
@ 2017-12-04 15:18           ` Eli Zaretskii
  1 sibling, 0 replies; 21+ messages in thread
From: Eli Zaretskii @ 2017-12-04 15:18 UTC (permalink / raw)
  To: Noam Postavsky; +Cc: 24914

> From: Noam Postavsky <npostavs@users.sourceforge.net>
> Date: Mon, 04 Dec 2017 01:27:27 -0500
> Cc: 24914@debbugs.gnu.org
> 
> Drew Adams <drew.adams@oracle.com> writes:
> 
> >> It's not a limit in Lisp, but in regex.c.
> >
> > We can't use something larger there?
> 
> Hmm, right, actually I see in regex.h:
> 
>     /* If sizeof(int) == 2, then ((1 << 15) - 1) overflows.  */
>     #define RE_DUP_MAX (0x7fff)
> 
> Does Emacs even support 16 bit platforms?

Emacs never did (the MS-DOS port of Emacs runs in i386 32-bit
protected mode on top of a 16-bit OS).  But regex.c did, at some very
distant past, to support the 16-bit MS compiler, or at least it tried
to.





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#24914: 24.5; isearch-regexp: wrong error message
  2017-12-04 14:52           ` Drew Adams
@ 2017-12-05  1:18             ` Noam Postavsky
  2017-12-05  3:15               ` Drew Adams
  2017-12-08  9:48               ` Eli Zaretskii
  0 siblings, 2 replies; 21+ messages in thread
From: Noam Postavsky @ 2017-12-05  1:18 UTC (permalink / raw)
  To: Drew Adams; +Cc: 24914

Drew Adams <drew.adams@oracle.com> writes:

> What I cited are cases where we do flag _particular kinds_
> of invalid regexps, and so tailor the error msg.

I'm not sure if you're citing actual code we have right now, or just
some hypotheticals.  In isearch.el, we pretty much do the opposite of
tailor the error message.

>> Actually, I'm wondering what's the point of isearch showing
>> "incomplete" instead of the actual regexp invalid error.
>> I.e., why not instead of
>> 
>>     \ [incomplete]
>>     \{ [incomplete]
>>     \{4 [incomplete]
>>     \{4000 [incomplete]
>>     \{4000\ [incomplete]
>>     \{4000\}
>> 
>> show this:
>> 
>>     \ [Trailing backslash]
>>     \{ [Unmatched \{]
>>     \{4 [Unmatched \{]
>>     \{4000 [Unmatched \{]
>>     \{4000\ [Trailing backslash]
>>     \{4000\}
>
> Feel free to work on that.  You might run into some cases
> that are not so clear-cut.  But you might well improve
> things generally in some way.

I meant just the following patch, you can try it out easily:

--- i/lisp/isearch.el
+++ w/lisp/isearch.el
@@ -2850,10 +2850,6 @@ isearch-search
     (invalid-regexp
      (setq isearch-error (car (cdr lossage)))
      (cond
-      ((string-match
-	"\\`Premature \\|\\`Unmatched \\|\\`Invalid "
-	isearch-error)
-       (setq isearch-error "incomplete input"))
       ((and (not isearch-regexp)
 	    (string-match "\\`Regular expression too big" isearch-error))
        (cond

Eli Zaretskii <eliz@gnu.org> writes:
>> 
>>     /* If sizeof(int) == 2, then ((1 << 15) - 1) overflows.  */
>>     #define RE_DUP_MAX (0x7fff)
>> 
>> Does Emacs even support 16 bit platforms?
>
> Emacs never did (the MS-DOS port of Emacs runs in i386 32-bit
> protected mode on top of a 16-bit OS).  But regex.c did, at some very
> distant past, to support the 16-bit MS compiler, or at least it tried
> to.

So changing to 2^31 as the max should be fine, right?

--- i/src/regex.h
+++ w/src/regex.h
@@ -270,8 +270,10 @@
 #ifdef RE_DUP_MAX
 # undef RE_DUP_MAX
 #endif
-/* If sizeof(int) == 2, then ((1 << 15) - 1) overflows.  */
-#define RE_DUP_MAX (0x7fff)
+/* If sizeof(int) == 4, then ((1 << 31) - 1) overflows.  This used to
+   be limited to 0x7fff, but Emacs never supported 16 bit platforms
+   anyway.  */
+#define RE_DUP_MAX (0x7fffffff)
 
 
 /* POSIX `cflags' bits (i.e., information for `regcomp').  */







^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#24914: 24.5; isearch-regexp: wrong error message
  2017-12-05  1:18             ` Noam Postavsky
@ 2017-12-05  3:15               ` Drew Adams
  2017-12-05  3:51                 ` Noam Postavsky
  2017-12-08  9:48               ` Eli Zaretskii
  1 sibling, 1 reply; 21+ messages in thread
From: Drew Adams @ 2017-12-05  3:15 UTC (permalink / raw)
  To: Noam Postavsky; +Cc: 24914

> > What I cited are cases where we do flag _particular kinds_
> > of invalid regexps, and so tailor the error msg.
> 
> I'm not sure if you're citing actual code we have right now, or just
> some hypotheticals.  In isearch.el, we pretty much do the opposite of
> tailor the error message.

I was citing what I thought were such cases in the current
isearch.el code - cases where we do not just say "Invalid
Regexp".  We say things like this:

 Too many words
 Too many spaces for whitespace matching
 Unmatched [ or [^

Granted, the last is used only in `isearch-query-replace.

My point was that in some existing cases (not many,
admittedly), we do try to give a more precise error
message when signal `invalid-regexp' is detected.

But I'm not sure what you're arguing, if you are arguing.

Certainly we don't tailor the message _much_ for the
kind of `invalid-regexp'.  But we do make some effort
to do that, even now, AFAICT.

> >> Actually, I'm wondering what's the point of isearch showing
> >> "incomplete" instead of the actual regexp invalid error.
> >> I.e., why not instead of
> >>
> >>     \ [incomplete]
> >>     \{ [incomplete]
> >>     \{4 [incomplete]
> >>     \{4000 [incomplete]
> >>     \{4000\ [incomplete]
> >>     \{4000\}
> >>
> >> show this:
> >>
> >>     \ [Trailing backslash]
> >>     \{ [Unmatched \{]
> >>     \{4 [Unmatched \{]
> >>     \{4000 [Unmatched \{]
> >>     \{4000\ [Trailing backslash]
> >>     \{4000\}
> 
> I meant just the following patch, you can try it out easily:
>      (invalid-regexp
>       (setq isearch-error (car (cdr lossage)))
>       (cond
> -      ((string-match
> -	"\\`Premature \\|\\`Unmatched \\|\\`Invalid "
> -	isearch-error)
> -       (setq isearch-error "incomplete input"))
>        ((and (not isearch-regexp)
>  	    (string-match "\\`Regular expression too big" isearch-error))
>         (cond

You mean show "[Invalid content of \{\}]" in all cases?
_Never_ show "[incomplete input]"?  Why would that be better?

Anyway, I don't have a strong opinion about that.  I do think
that in the case reported it's too bad that we say "[incomplete
input]".

But I don't think it follows that it would be more helpful to
most users to show the invalid-regexp description in cases
where Emacs can really tell that the input is necessarily
incomplete.  I suspect that it is quite common for that
"incomplete input" message to be helpful.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#24914: 24.5; isearch-regexp: wrong error message
  2017-12-05  3:15               ` Drew Adams
@ 2017-12-05  3:51                 ` Noam Postavsky
  2017-12-05  4:52                   ` Drew Adams
  0 siblings, 1 reply; 21+ messages in thread
From: Noam Postavsky @ 2017-12-05  3:51 UTC (permalink / raw)
  To: Drew Adams; +Cc: 24914

Drew Adams <drew.adams@oracle.com> writes:

> I was citing what I thought were such cases in the current
> isearch.el code - cases where we do not just say "Invalid
> Regexp".  We say things like this:
>
>  Too many words
>  Too many spaces for whitespace matching

In that case, the regexp is constructed by Emacs on behalf of the user,
so it makes sense to translate the error message so that it matches the
user operation.

>  Unmatched [ or [^

AFAICT, the isearch.el code doesn't write that message, but rather reads
it.

> Granted, the last is used only in `isearch-query-replace.
>
> My point was that in some existing cases (not many,
> admittedly), we do try to give a more precise error
> message when signal `invalid-regexp' is detected.
>
> But I'm not sure what you're arguing, if you are arguing.

The C regex code produces the following error messages when parsing
regexps:

  {
    gettext_noop ("Success"),	/* REG_NOERROR */
    gettext_noop ("No match"),	/* REG_NOMATCH */
    gettext_noop ("Invalid regular expression"), /* REG_BADPAT */
    gettext_noop ("Invalid collation character"), /* REG_ECOLLATE */
    gettext_noop ("Invalid character class name"), /* REG_ECTYPE */
    gettext_noop ("Trailing backslash"), /* REG_EESCAPE */
    gettext_noop ("Invalid back reference"), /* REG_ESUBREG */
    gettext_noop ("Unmatched [ or [^"),	/* REG_EBRACK */
    gettext_noop ("Unmatched ( or \\("), /* REG_EPAREN */
    gettext_noop ("Unmatched \\{"), /* REG_EBRACE */
    gettext_noop ("Invalid content of \\{\\}"), /* REG_BADBR */
    gettext_noop ("Invalid range end"),	/* REG_ERANGE */
    gettext_noop ("Memory exhausted"), /* REG_ESPACE */
    gettext_noop ("Invalid preceding regular expression"), /* REG_BADRPT */
    gettext_noop ("Premature end of regular expression"), /* REG_EEND */
    gettext_noop ("Regular expression too big"), /* REG_ESIZE */
    gettext_noop ("Unmatched ) or \\)"), /* REG_ERPAREN */
    gettext_noop ("Range striding over charsets"), /* REG_ERANGEX  */
  };

When doing isearch-*-regexp, most of those errors are converted into
"incomplete" (i.e., *less* precise).  But I think it would be more
helpful to show the original error message.

> But I don't think it follows that it would be more helpful to
> most users to show the invalid-regexp description in cases
> where Emacs can really tell that the input is necessarily
> incomplete.  I suspect that it is quite common for that
> "incomplete input" message to be helpful.

How does it help (compared to the more precise message)?  Seems to me
that telling the user they haven't finished entering the regexp isn't
especially helpful; surely the user already knows they haven't finished
typing yet.





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#24914: 24.5; isearch-regexp: wrong error message
  2017-12-05  3:51                 ` Noam Postavsky
@ 2017-12-05  4:52                   ` Drew Adams
  2017-12-05 13:27                     ` Noam Postavsky
  0 siblings, 1 reply; 21+ messages in thread
From: Drew Adams @ 2017-12-05  4:52 UTC (permalink / raw)
  To: Noam Postavsky; +Cc: 24914

>     gettext_noop ("Success"),	/* REG_NOERROR */
>     gettext_noop ("No match"),	/* REG_NOMATCH */
>     gettext_noop ("Invalid regular expression"), /* REG_BADPAT */
>     gettext_noop ("Invalid collation character"), /* REG_ECOLLATE */
>     gettext_noop ("Invalid character class name"), /* REG_ECTYPE */
>     gettext_noop ("Trailing backslash"), /* REG_EESCAPE */
>     gettext_noop ("Invalid back reference"), /* REG_ESUBREG */
>     gettext_noop ("Unmatched [ or [^"),	/* REG_EBRACK */
>     gettext_noop ("Unmatched ( or \\("), /* REG_EPAREN */
>     gettext_noop ("Unmatched \\{"), /* REG_EBRACE */
>     gettext_noop ("Invalid content of \\{\\}"), /* REG_BADBR */
>     gettext_noop ("Invalid range end"),	/* REG_ERANGE */
>     gettext_noop ("Memory exhausted"), /* REG_ESPACE */
>     gettext_noop ("Invalid preceding regular expression"), /* REG_BADRPT
> */
>     gettext_noop ("Premature end of regular expression"), /* REG_EEND */
>     gettext_noop ("Regular expression too big"), /* REG_ESIZE */
>     gettext_noop ("Unmatched ) or \\)"), /* REG_ERPAREN */
>     gettext_noop ("Range striding over charsets"), /* REG_ERANGEX  */
> 
> When doing isearch-*-regexp, most of those errors are converted into
> "incomplete" (i.e., *less* precise).  But I think it would be more
> helpful to show the original error message.

Agreed.  Unless Emacs can be sure that in some context one of
them really means that the regexp is not complete.  If the
user can just keep typing to make a valid regexp then an error
msg (premature, not yet warranted) typically hurts more than it
helps, I think.

But if Emacs can't tell that, than sure, why not?

Timing can mean a lot also (but depends on the user and
how much thinking is going on).  It's not great to interrupt
immediately with an error msg if the user was still typing.

And clearly some of those error conditions do _not_ end
up translated as "incomplete input" messages - or they
should not, in any case.

Clearly someone made a decision about "Trailing backslash",
for instance, and it's a very good decision IMO.  That's a
more useful "the-pattern-is-incomplete" message than just
saying "incomplete input".

We are not the first to consider the list of error
conditions and what to do about this one or that one.
That doesn't imply that the judgment made previously is
the best one. It does suggest perhaps consulting those
who might have made it, or the larger emacs-devel community.

The behavior could be completely one-sided one way or
the other way.  Or it could be any kind of mix in between.
It is currently a mix, but obviously not a perfect one -
hence this bug report.  Which tradeoff is best?

> > But I don't think it follows that it would be more helpful to
> > most users to show the invalid-regexp description in cases
> > where Emacs can really tell that the input is necessarily
> > incomplete.  I suspect that it is quite common for that
> > "incomplete input" message to be helpful.
> 
> How does it help (compared to the more precise message)?

See above.  Isearch is incremental: you don't just enter
a complete regexp once and for all (as in `grep', for
instance.  If Emacs jumps the gun with a premature "error"
msg, that can be annoying, no?

> Seems to me
> that telling the user they haven't finished entering the regexp isn't
> especially helpful; surely the user already knows they haven't finished
> typing yet.

No, _not_ surely - I think.  Many (most? maybe, maybe not)
users are not all that positive about using Emacs regexps.

They should be able to interact with Isearch regexps
interactively, incrementally.  And yes, I think that it
can be helpful to let a user know that the current pattern
is incomplete as a regexp.

But hey, users are different.  I'd argue that we could
add an option, with the default setting the current
behavior (as I expect it is those less experienced that
the "incomplete" behavior could benefit the most, and
those more experience who could benefit most from the
specific "invalid" messages).  The latter are probably
also the ones most likely to try more complicated
regexps, which benefit the most, I expect, from precise
"invalid" messages.

Adding the behavior you propose as an option shouldn't
hurt.  But even for that you might want to propose it
at emacs-devel.  There might be people there more
familiar with different use cases or who know more about
the history of why the current behavior is as it is.

Just a suggestion.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#24914: 24.5; isearch-regexp: wrong error message
  2017-12-05  4:52                   ` Drew Adams
@ 2017-12-05 13:27                     ` Noam Postavsky
  2017-12-05 15:31                       ` Drew Adams
  0 siblings, 1 reply; 21+ messages in thread
From: Noam Postavsky @ 2017-12-05 13:27 UTC (permalink / raw)
  To: Drew Adams; +Cc: 24914

Drew Adams <drew.adams@oracle.com> writes:

> Clearly someone made a decision about "Trailing backslash",
> for instance, and it's a very good decision IMO.  That's a
> more useful "the-pattern-is-incomplete" message than just
> saying "incomplete input".

Right, but I can't see why the same shouldn't apply to "Unmatched \\{"
and all the others.

> We are not the first to consider the list of error
> conditions and what to do about this one or that one.
> That doesn't imply that the judgment made previously is
> the best one. It does suggest perhaps consulting those
> who might have made it, or the larger emacs-devel community.

That code seems to have been there since 1992 "Initial revision", so
it's not clear what, if any, considerations were made.

> See above.  Isearch is incremental: you don't just enter
> a complete regexp once and for all (as in `grep', for
> instance.  If Emacs jumps the gun with a premature "error"
> msg, that can be annoying, no?

We already get an "error" message, it says "incomplete".  The question
is why shouldn't it instead say *why* it's incomplete.

> Adding the behavior you propose as an option shouldn't
> hurt.

It hurts, because it adds yet another option, which makes a user's job
of going through them and deciding which ones make sense that much
harder (yes, just this particular addded option won't make much
difference, but still).

>There might be people there more familiar with different use cases or
>who know more about the history of why the current behavior is as it
>is.

I hope so.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#24914: 24.5; isearch-regexp: wrong error message
  2017-12-05 13:27                     ` Noam Postavsky
@ 2017-12-05 15:31                       ` Drew Adams
  2017-12-06  2:52                         ` Noam Postavsky
  0 siblings, 1 reply; 21+ messages in thread
From: Drew Adams @ 2017-12-05 15:31 UTC (permalink / raw)
  To: Noam Postavsky; +Cc: 24914

> > Clearly someone made a decision about "Trailing backslash",
> > for instance, and it's a very good decision IMO.  That's a
> > more useful "the-pattern-is-incomplete" message than just
> > saying "incomplete input".
> 
> Right, but I can't see why the same shouldn't apply to
> "Unmatched \\{" and all the others.

Treating them all the same is one possibility.
Not the best one, I think, but one possibility.

> > We are not the first to consider the list of error
> > conditions and what to do about this one or that one.
> > That doesn't imply that the judgment made previously is
> > the best one. It does suggest perhaps consulting those
> > who might have made it, or the larger emacs-devel community.
> 
> That code seems to have been there since 1992 "Initial
> revision", so it's not clear what, if any, considerations
> were made.

It might not be clear, but that doesn't mean there weren't
good reasons for that judgment.  (And no, I'm not saying
that a different judgment can't be made now.)

And certainly some of those around then, including
deciders probably, are still around today.  Perhaps
RMS has an opinion or recollection about this?

> > See above.  Isearch is incremental: you don't just enter
> > a complete regexp once and for all (as in `grep', for
> > instance.  If Emacs jumps the gun with a premature "error"
> > msg, that can be annoying, no?
> 
> We already get an "error" message, it says "incomplete".
> The question is why shouldn't it instead say *why* it's
> incomplete.

I thought your proposal was to, in all cases, eliminate
saying it is incomplete in favor of the specific
regexp-invalidity error text.

Such error text does not, generally and directly, tell
users that the input is incomplete.  Users very familiar
with regexps might understand that such a msg implies
that input is incomplete, but not everyone will get that.

> > Adding the behavior you propose as an option shouldn't
> > hurt.
> 
> It hurts, because it adds yet another option, which makes a user's job
> of going through them and deciding which ones make sense that much
> harder (yes, just this particular addded option won't make much
> difference, but still).

Users who are very familiar with Emacs regexps will be
the ones to benefit most from the specific regexp-validity
msgs, IMO.  They should have no problem customizing an
option.  Users unfamiliar with Emacs or Emacs regexps will
get the simpler default behavior: your input pattern is
not yet complete.

If you feel strongly about this and are opposed to
adding a user option, consider proposing the change
you want to emacs-devel.

This particular bug is about one case: just the
particular "incomplete input" message case cited.
Fixing this bug shouldn't require changing the
basic behavior, though that is certainly one
possibility.

You apparently think there is never any value in
telling users that the input pattern is not
complete as a regexp.  I disagree.  We apparently
agree that at least in some cases the specific
regexp-invalidity message is more helpful.

There is a spectrum of possibilities here.  I see
no special reason why the right approach should
be all or none.

> > There might be people there more familiar with different
> > use cases or who know more about the history of why the
> > current behavior is as it is.
> 
> I hope so.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#24914: 24.5; isearch-regexp: wrong error message
  2017-12-05 15:31                       ` Drew Adams
@ 2017-12-06  2:52                         ` Noam Postavsky
  0 siblings, 0 replies; 21+ messages in thread
From: Noam Postavsky @ 2017-12-06  2:52 UTC (permalink / raw)
  To: Drew Adams; +Cc: 24914

[-- Attachment #1: Type: text/plain, Size: 857 bytes --]

tags 24914 + patch
quit

Drew Adams <drew.adams@oracle.com> writes:

> Such error text does not, generally and directly, tell
> users that the input is incomplete.  Users very familiar
> with regexps might understand that such a msg implies
> that input is incomplete, but not everyone will get that.

Hmm, I hadn't considered that possibility, but I will allow that *could*
be a symptom of my being overly familiar with regexp syntax.

> You apparently think there is never any value in
> telling users that the input pattern is not
> complete as a regexp.  I disagree.  We apparently
> agree that at least in some cases the specific
> regexp-invalidity message is more helpful.

Okay, I've looked at the error messages a bit more closely, and I
believe all the "Invalid ..." ones should never be considered
"incomplete".  See commit message for details.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: patch --]
[-- Type: text/x-diff, Size: 5656 bytes --]

From 1d32f4d28521a143c333ef4cc125419661e3a3a9 Mon Sep 17 00:00:00 2001
From: Noam Postavsky <npostavs@gmail.com>
Date: Sat, 2 Dec 2017 19:01:54 -0500
Subject: [PATCH] Raise limit of regexp repetition (Bug#24914)

* src/regex.h (RE_DUP_MAX): Raise limit to 2^32-1.
* etc/NEWS: Announce it.
* doc/lispref/searching.texi (Regexp Backslash): Document it.

* src/regex.h (reg_errcode_t): Add REG_ESIZEBR code.
* src/regex.c (re_error_msgid): Add corresponding entry.
(GET_INTERVAL_COUNT): Return it instead of the more generic REG_EBADBR
when encountering a repetition greater than RE_DUP_MAX.

* lisp/isearch.el (isearch-search): Don't convert errors starting with
"Invalid" into "incomplete".  Such errors are not incomplete, in the
sense that they cannot be corrected by appending more characters to
the end of the regexp.  The affected error messages are:

- REG_BADPAT "Invalid regular expression"
  - \\(?X:\\) where X is not a legal group number
  - \\_X where X is not < or >

- REG_ECOLLATE "Invalid collation character"
  - There is no code to throw this.

- REG_ECTYPE "Invalid character class name"
  - [[:foo:] where foo is not a valid class name

- REG_ESUBREG "Invalid back reference"
  - \N where N is referenced before matching group N

- REG_BADBR "Invalid content of \\{\\}"
  - \\{N,M\\} where N < 0, M < N, M or N larger than max
  - \\{NX where X is not a digit or backslash
  - \\{N\\X where X is not a }

- REG_ERANGE "Invalid range end"
  - There is no code to throw this.

- REG_BADRPT "Invalid preceding regular expression"
  - We never throw this.  It would usually indicate a "*" with no
    preceding regexp text, but Emacs allows that to match a literal
    "*".
---
 doc/lispref/searching.texi | 9 ++++++++-
 etc/NEWS                   | 8 ++++++++
 lisp/isearch.el            | 2 +-
 src/regex.c                | 5 +++--
 src/regex.h                | 9 ++++++---
 5 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 755fa554bb..724d66b5e3 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -639,7 +639,14 @@ Regexp Backslash
 is a more general postfix operator that specifies repetition with a
 minimum of @var{m} repeats and a maximum of @var{n} repeats.  If @var{m}
 is omitted, the minimum is 0; if @var{n} is omitted, there is no
-maximum.
+maximum.  For both forms, @var{m} and @var{n}, if specified, may be no
+larger than
+@ifnottex
+2**31 @minus{} 1
+@end ifnottex
+@tex
+@math{2^{31}-1}
+@end tex
 
 For example, @samp{c[ad]\@{1,2\@}r} matches the strings @samp{car},
 @samp{cdr}, @samp{caar}, @samp{cadr}, @samp{cdar}, and @samp{cddr}, and
diff --git a/etc/NEWS b/etc/NEWS
index 4ccf468693..579cad058e 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -509,6 +509,14 @@ instead.
 ** The new user option 'arabic-shaper-ZWNJ-handling' controls how to
 handle ZWNJ in Arabic text rendering.
 
++++
+** The limit on repetitions in regexps has been raised to 2^31-1.
+It was previously undocumented and limited to 2^15-1.  For example,
+the following regular expression was previously invalid, but is now
+accepted:
+
+   x\{32768\}
+
 \f
 * Editing Changes in Emacs 26.1
 
diff --git a/lisp/isearch.el b/lisp/isearch.el
index 13fa97ea71..093185a096 100644
--- a/lisp/isearch.el
+++ b/lisp/isearch.el
@@ -2851,7 +2851,7 @@ isearch-search
      (setq isearch-error (car (cdr lossage)))
      (cond
       ((string-match
-	"\\`Premature \\|\\`Unmatched \\|\\`Invalid "
+	"\\`Premature \\|\\`Unmatched "
 	isearch-error)
        (setq isearch-error "incomplete input"))
       ((and (not isearch-regexp)
diff --git a/src/regex.c b/src/regex.c
index 330f2f78a8..ab74f457d4 100644
--- a/src/regex.c
+++ b/src/regex.c
@@ -1200,7 +1200,8 @@ WEAK_ALIAS (__re_set_syntax, re_set_syntax)
     gettext_noop ("Premature end of regular expression"), /* REG_EEND */
     gettext_noop ("Regular expression too big"), /* REG_ESIZE */
     gettext_noop ("Unmatched ) or \\)"), /* REG_ERPAREN */
-    gettext_noop ("Range striding over charsets") /* REG_ERANGEX  */
+    gettext_noop ("Range striding over charsets"), /* REG_ERANGEX  */
+    gettext_noop ("Invalid content of \\{\\}, repetitions too big") /* REG_ESIZEBR  */
   };
 \f
 /* Whether to allocate memory during matching.  */
@@ -1921,7 +1922,7 @@ while (REMAINING_AVAIL_SLOTS <= space) {				\
 	    if (num < 0)						\
 	      num = 0;							\
 	    if (RE_DUP_MAX / 10 - (RE_DUP_MAX % 10 < c - '0') < num)	\
-	      FREE_STACK_RETURN (REG_BADBR);				\
+	      FREE_STACK_RETURN (REG_ESIZEBR);				\
 	    num = num * 10 + c - '0';					\
 	    if (p == pend)						\
 	      FREE_STACK_RETURN (REG_EBRACE);				\
diff --git a/src/regex.h b/src/regex.h
index 9fa8356011..b829848586 100644
--- a/src/regex.h
+++ b/src/regex.h
@@ -270,8 +270,10 @@
 #ifdef RE_DUP_MAX
 # undef RE_DUP_MAX
 #endif
-/* If sizeof(int) == 2, then ((1 << 15) - 1) overflows.  */
-#define RE_DUP_MAX (0x7fff)
+/* If sizeof(int) == 4, then ((1 << 31) - 1) overflows.  This used to
+   be limited to 0x7fff, but Emacs never supported 16 bit platforms
+   anyway.  */
+#define RE_DUP_MAX (0x7fffffff)
 
 
 /* POSIX `cflags' bits (i.e., information for `regcomp').  */
@@ -337,7 +339,8 @@
   REG_EEND,		/* Premature end.  */
   REG_ESIZE,		/* Compiled pattern bigger than 2^16 bytes.  */
   REG_ERPAREN,		/* Unmatched ) or \); not returned from regcomp.  */
-  REG_ERANGEX		/* Range striding over charsets.  */
+  REG_ERANGEX,		/* Range striding over charsets.  */
+  REG_ESIZEBR           /* n or m too big in \{n,m\} */
 } reg_errcode_t;
 \f
 /* This data structure represents a compiled pattern.  Before calling
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* bug#24914: 24.5; isearch-regexp: wrong error message
  2017-12-05  1:18             ` Noam Postavsky
  2017-12-05  3:15               ` Drew Adams
@ 2017-12-08  9:48               ` Eli Zaretskii
  2017-12-08 13:32                 ` Noam Postavsky
  1 sibling, 1 reply; 21+ messages in thread
From: Eli Zaretskii @ 2017-12-08  9:48 UTC (permalink / raw)
  To: Noam Postavsky; +Cc: 24914

> From: Noam Postavsky <npostavs@users.sourceforge.net>
> Date: Mon, 04 Dec 2017 20:18:02 -0500
> Cc: 24914@debbugs.gnu.org
> 
> > Emacs never did (the MS-DOS port of Emacs runs in i386 32-bit
> > protected mode on top of a 16-bit OS).  But regex.c did, at some very
> > distant past, to support the 16-bit MS compiler, or at least it tried
> > to.
> 
> So changing to 2^31 as the max should be fine, right?

Or maybe just INT_MAX?





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#24914: 24.5; isearch-regexp: wrong error message
  2017-12-08  9:48               ` Eli Zaretskii
@ 2017-12-08 13:32                 ` Noam Postavsky
  2017-12-08 14:35                   ` Eli Zaretskii
  0 siblings, 1 reply; 21+ messages in thread
From: Noam Postavsky @ 2017-12-08 13:32 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 24914

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Noam Postavsky <npostavs@users.sourceforge.net>
>> Date: Mon, 04 Dec 2017 20:18:02 -0500
>> Cc: 24914@debbugs.gnu.org
>> 
>> > Emacs never did (the MS-DOS port of Emacs runs in i386 32-bit
>> > protected mode on top of a 16-bit OS).  But regex.c did, at some very
>> > distant past, to support the 16-bit MS compiler, or at least it tried
>> > to.
>> 
>> So changing to 2^31 as the max should be fine, right?
>
> Or maybe just INT_MAX?

I thought it would be easier to document the limit if it's fixed across
all machines.  Otherwise we would have to say something like "For both
forms, m and n, if specified, may be no larger than INT_MAX, which is
usually 2**31 - 1, but could be 2**63 - 1 depending on the compiler used
for building Emacs".





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#24914: 24.5; isearch-regexp: wrong error message
  2017-12-08 13:32                 ` Noam Postavsky
@ 2017-12-08 14:35                   ` Eli Zaretskii
  2017-12-10  2:18                     ` Noam Postavsky
  0 siblings, 1 reply; 21+ messages in thread
From: Eli Zaretskii @ 2017-12-08 14:35 UTC (permalink / raw)
  To: Noam Postavsky; +Cc: 24914

> From: Noam Postavsky <npostavs@users.sourceforge.net>
> Cc: drew.adams@oracle.com,  24914@debbugs.gnu.org
> Date: Fri, 08 Dec 2017 08:32:42 -0500
> 
> >> So changing to 2^31 as the max should be fine, right?
> >
> > Or maybe just INT_MAX?
> 
> I thought it would be easier to document the limit if it's fixed across
> all machines.  Otherwise we would have to say something like "For both
> forms, m and n, if specified, may be no larger than INT_MAX, which is
> usually 2**31 - 1, but could be 2**63 - 1 depending on the compiler used
> for building Emacs".

Isn't int 32 bit wide everywhere?

And anyway, since the bitmap is stored in an int, isn't INT_MAX TRT?

But I'm not a language expert enough to argue; feel free to use your
form if you think it's better.





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#24914: 24.5; isearch-regexp: wrong error message
  2017-12-08 14:35                   ` Eli Zaretskii
@ 2017-12-10  2:18                     ` Noam Postavsky
  2017-12-10  6:49                       ` Eli Zaretskii
  0 siblings, 1 reply; 21+ messages in thread
From: Noam Postavsky @ 2017-12-10  2:18 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 24914

[-- Attachment #1: Type: text/plain, Size: 2773 bytes --]

Eli Zaretskii <eliz@gnu.org> writes:

>> I thought it would be easier to document the limit if it's fixed across
>> all machines.  Otherwise we would have to say something like "For both
>> forms, m and n, if specified, may be no larger than INT_MAX, which is
>> usually 2**31 - 1, but could be 2**63 - 1 depending on the compiler used
>> for building Emacs".
>
> Isn't int 32 bit wide everywhere?

I might have been mixing up int with long when I was thinking about
this; it seems only a few very obscure platforms have 64 bit ints.
According to [1], everywhere but "HAL Computer Systems port of Solaris
to the SPARC64" and "Classic UNICOS" has 32 bit ints.

[1]: https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models

> And anyway, since the bitmap is stored in an int, isn't INT_MAX TRT?

Unfortunately, all this discussion of int size seems to be academic.  I
took another look at the code, there is another limit due to regexp
opcode format.  We can raise the limit to 2^16-1 though.

Here is the use of RE_DUP_MAX, which makes it seem like int-size is the
main limit:

    /* Get the next unsigned number in the uncompiled pattern.  */
    #define GET_INTERVAL_COUNT(num)					\
      ...
                if (RE_DUP_MAX / 10 - (RE_DUP_MAX % 10 < c - '0') < num)	\
                  FREE_STACK_RETURN (REG_ESIZEBR);				\


    static reg_errcode_t
    regex_compile (const_re_char *pattern, size_t size,
    {
      ...
		int lower_bound = 0, upper_bound = -1;
                [...]
		GET_INTERVAL_COUNT (lower_bound);

But then

			INSERT_JUMP2 (succeed_n, laststart,
				      b + 5 + nbytes,
				      lower_bound);

    /* Like `STORE_JUMP2', but for inserting.  Assume `b' is the buffer end.  */
    #define INSERT_JUMP2(op, loc, to, arg) \
      insert_op2 (op, loc, (to) - (loc) - 3, arg, b)

    /* Like `insert_op1', but for two two-byte parameters ARG1 and ARG2.  */
                                      ^^^^^^^^
    static void
    insert_op2 (re_opcode_t op, unsigned char *loc, int arg1, int arg2, unsigned char *end)
    {
      ...
      store_op2 (op, loc, arg1, arg2);
    }

    /* Like `store_op1', but for two two-byte parameters ARG1 and ARG2.  */
                                     ^^^^^^^^
    static void
    store_op2 (re_opcode_t op, unsigned char *loc, int arg1, int arg2)
    {
      *loc = (unsigned char) op;
      STORE_NUMBER (loc + 1, arg1);
      STORE_NUMBER (loc + 3, arg2);
    }

    /* Store NUMBER in two contiguous bytes starting at DESTINATION.  */
                       ^^^^^^^^^^^^^^^^^^^^

    #define STORE_NUMBER(destination, number)				\
      do {									\
        (destination)[0] = (number) & 0377;					\
        (destination)[1] = (number) >> 8;					\
      } while (0)


Here is the updated patch:


[-- Attachment #2: patch --]
[-- Type: text/plain, Size: 6410 bytes --]

From 6c3ead6bd5c61801915dcedbb8dd17622610a899 Mon Sep 17 00:00:00 2001
From: Noam Postavsky <npostavs@gmail.com>
Date: Sat, 2 Dec 2017 19:01:54 -0500
Subject: [PATCH] Raise limit of regexp repetition (Bug#24914)

* src/regex.h (RE_DUP_MAX): Raise limit to 2^16-1.
* etc/NEWS: Announce it.
* doc/lispref/searching.texi (Regexp Backslash): Document it.
* test/src/regex-tests.el (regex-repeat-limit): Test it.

* src/regex.h (reg_errcode_t): Add REG_ESIZEBR code.
* src/regex.c (re_error_msgid): Add corresponding entry.
(GET_INTERVAL_COUNT): Return it instead of the more generic REG_EBADBR
when encountering a repetition greater than RE_DUP_MAX.

* lisp/isearch.el (isearch-search): Don't convert errors starting with
"Invalid" into "incomplete".  Such errors are not incomplete, in the
sense that they cannot be corrected by appending more characters to
the end of the regexp.  The affected error messages are:

- REG_BADPAT "Invalid regular expression"
  - \\(?X:\\) where X is not a legal group number
  - \\_X where X is not < or >

- REG_ECOLLATE "Invalid collation character"
  - There is no code to throw this.

- REG_ECTYPE "Invalid character class name"
  - [[:foo:] where foo is not a valid class name

- REG_ESUBREG "Invalid back reference"
  - \N where N is referenced before matching group N

- REG_BADBR "Invalid content of \\{\\}"
  - \\{N,M\\} where N < 0, M < N, M or N larger than max
  - \\{NX where X is not a digit or backslash
  - \\{N\\X where X is not a }

- REG_ERANGE "Invalid range end"
  - There is no code to throw this.

- REG_BADRPT "Invalid preceding regular expression"
  - We never throw this.  It would usually indicate a "*" with no
    preceding regexp text, but Emacs allows that to match a literal
    "*".
---
 doc/lispref/searching.texi | 10 +++++++++-
 etc/NEWS                   |  8 ++++++++
 lisp/isearch.el            |  2 +-
 src/regex.c                |  5 +++--
 src/regex.h                |  9 ++++++---
 test/src/regex-tests.el    |  6 ++++++
 6 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 755fa554bb..ab52cf2802 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -639,7 +639,15 @@ Regexp Backslash
 is a more general postfix operator that specifies repetition with a
 minimum of @var{m} repeats and a maximum of @var{n} repeats.  If @var{m}
 is omitted, the minimum is 0; if @var{n} is omitted, there is no
-maximum.
+maximum.  For both forms, @var{m} and @var{n}, if specified, may be no
+larger than
+@ifnottex
+2**16 @minus{} 1
+@end ifnottex
+@tex
+@math{2^{16}-1}
+@end tex
+.
 
 For example, @samp{c[ad]\@{1,2\@}r} matches the strings @samp{car},
 @samp{cdr}, @samp{caar}, @samp{cadr}, @samp{cdar}, and @samp{cddr}, and
diff --git a/etc/NEWS b/etc/NEWS
index 64b53d88c8..c7efc53f6a 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -509,6 +509,14 @@ instead.
 ** The new user option 'arabic-shaper-ZWNJ-handling' controls how to
 handle ZWNJ in Arabic text rendering.
 
++++
+** The limit on repetitions in regexps has been raised to 2^16-1.
+It was previously undocumented and limited to 2^15-1.  For example,
+the following regular expression was previously invalid, but is now
+accepted:
+
+   x\{32768\}
+
 \f
 * Editing Changes in Emacs 26.1
 
diff --git a/lisp/isearch.el b/lisp/isearch.el
index 13fa97ea71..093185a096 100644
--- a/lisp/isearch.el
+++ b/lisp/isearch.el
@@ -2851,7 +2851,7 @@ isearch-search
      (setq isearch-error (car (cdr lossage)))
      (cond
       ((string-match
-	"\\`Premature \\|\\`Unmatched \\|\\`Invalid "
+	"\\`Premature \\|\\`Unmatched "
 	isearch-error)
        (setq isearch-error "incomplete input"))
       ((and (not isearch-regexp)
diff --git a/src/regex.c b/src/regex.c
index 330f2f78a8..ab74f457d4 100644
--- a/src/regex.c
+++ b/src/regex.c
@@ -1200,7 +1200,8 @@ WEAK_ALIAS (__re_set_syntax, re_set_syntax)
     gettext_noop ("Premature end of regular expression"), /* REG_EEND */
     gettext_noop ("Regular expression too big"), /* REG_ESIZE */
     gettext_noop ("Unmatched ) or \\)"), /* REG_ERPAREN */
-    gettext_noop ("Range striding over charsets") /* REG_ERANGEX  */
+    gettext_noop ("Range striding over charsets"), /* REG_ERANGEX  */
+    gettext_noop ("Invalid content of \\{\\}, repetitions too big") /* REG_ESIZEBR  */
   };
 \f
 /* Whether to allocate memory during matching.  */
@@ -1921,7 +1922,7 @@ while (REMAINING_AVAIL_SLOTS <= space) {				\
 	    if (num < 0)						\
 	      num = 0;							\
 	    if (RE_DUP_MAX / 10 - (RE_DUP_MAX % 10 < c - '0') < num)	\
-	      FREE_STACK_RETURN (REG_BADBR);				\
+	      FREE_STACK_RETURN (REG_ESIZEBR);				\
 	    num = num * 10 + c - '0';					\
 	    if (p == pend)						\
 	      FREE_STACK_RETURN (REG_EBRACE);				\
diff --git a/src/regex.h b/src/regex.h
index 9fa8356011..4c8632d6aa 100644
--- a/src/regex.h
+++ b/src/regex.h
@@ -270,8 +270,10 @@
 #ifdef RE_DUP_MAX
 # undef RE_DUP_MAX
 #endif
-/* If sizeof(int) == 2, then ((1 << 15) - 1) overflows.  */
-#define RE_DUP_MAX (0x7fff)
+/* Repeat counts are stored in opcodes as 2 byte integers.  This was
+   previously limited to 7fff because the parsing code uses signed
+   ints.  But Emacs only runs on 32 bit platforms anyway.  */
+#define RE_DUP_MAX (0xffff)
 
 
 /* POSIX `cflags' bits (i.e., information for `regcomp').  */
@@ -337,7 +339,8 @@
   REG_EEND,		/* Premature end.  */
   REG_ESIZE,		/* Compiled pattern bigger than 2^16 bytes.  */
   REG_ERPAREN,		/* Unmatched ) or \); not returned from regcomp.  */
-  REG_ERANGEX		/* Range striding over charsets.  */
+  REG_ERANGEX,		/* Range striding over charsets.  */
+  REG_ESIZEBR           /* n or m too big in \{n,m\} */
 } reg_errcode_t;
 \f
 /* This data structure represents a compiled pattern.  Before calling
diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el
index b1f1ea71ce..872d16a085 100644
--- a/test/src/regex-tests.el
+++ b/test/src/regex-tests.el
@@ -677,4 +677,10 @@ regex-tests-TESTS
 This evaluates the TESTS test cases from glibc."
   (should-not (regex-tests-TESTS)))
 
+(ert-deftest regex-repeat-limit ()
+  "Test the #xFFFF repeat limit."
+  (should (string-match "\\`x\\{65535\\}" (make-string 65535 ?x)))
+  (should-not (string-match "\\`x\\{65535\\}" (make-string 65534 ?x)))
+  (should-error (string-match "\\`x\\{65536\\}" "X") :type invalid-regexp))
+
 ;;; regex-tests.el ends here
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* bug#24914: 24.5; isearch-regexp: wrong error message
  2017-12-10  2:18                     ` Noam Postavsky
@ 2017-12-10  6:49                       ` Eli Zaretskii
  2018-01-27  2:05                         ` Noam Postavsky
  0 siblings, 1 reply; 21+ messages in thread
From: Eli Zaretskii @ 2017-12-10  6:49 UTC (permalink / raw)
  To: Noam Postavsky; +Cc: 24914

> From: Noam Postavsky <npostavs@users.sourceforge.net>
> Cc: drew.adams@oracle.com,  24914@debbugs.gnu.org
> Date: Sat, 09 Dec 2017 21:18:05 -0500
> 
> Unfortunately, all this discussion of int size seems to be academic.  I
> took another look at the code, there is another limit due to regexp
> opcode format.  We can raise the limit to 2^16-1 though.
> [...]
> Here is the updated patch:

LGTM for master.  Thanks for the research and for the patch.





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#24914: 24.5; isearch-regexp: wrong error message
  2017-12-10  6:49                       ` Eli Zaretskii
@ 2018-01-27  2:05                         ` Noam Postavsky
  0 siblings, 0 replies; 21+ messages in thread
From: Noam Postavsky @ 2018-01-27  2:05 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 24914

tags 24914 fixed
close 24914 27.1
quit

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Noam Postavsky <npostavs@users.sourceforge.net>
>> Cc: drew.adams@oracle.com,  24914@debbugs.gnu.org
>> Date: Sat, 09 Dec 2017 21:18:05 -0500
>> 
>> Unfortunately, all this discussion of int size seems to be academic.  I
>> took another look at the code, there is another limit due to regexp
>> opcode format.  We can raise the limit to 2^16-1 though.
>> [...]
>> Here is the updated patch:
>
> LGTM for master.  Thanks for the research and for the patch.

Pushed to master [1: 559f160616], also documented limit in emacs-26 [2: 463f96b481].

[1: 559f160616]: 2018-01-26 20:49:44 -0500
  Raise limit of regexp repetition (Bug#24914)
  https://git.savannah.gnu.org/cgit/emacs.git/commit/?id=559f1606166822394df3988c18c0ad02984ac675

[2: 463f96b481]: 2018-01-26 19:53:09 -0500
  * doc/lispref/searching.texi: Document regexp repetition limit.
  https://git.savannah.gnu.org/cgit/emacs.git/commit/?id=463f96b4813fb77d88a7b0fa93f94aa08d71689f





^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2018-01-27  2:05 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-11-09 22:29 bug#24914: 24.5; isearch-regexp: wrong error message Drew Adams
2017-12-03 16:37 ` Noam Postavsky
2017-12-03 18:00   ` Drew Adams
2017-12-03 18:13     ` Noam Postavsky
2017-12-03 18:56       ` Drew Adams
2017-12-04  6:27         ` Noam Postavsky
2017-12-04 14:52           ` Drew Adams
2017-12-05  1:18             ` Noam Postavsky
2017-12-05  3:15               ` Drew Adams
2017-12-05  3:51                 ` Noam Postavsky
2017-12-05  4:52                   ` Drew Adams
2017-12-05 13:27                     ` Noam Postavsky
2017-12-05 15:31                       ` Drew Adams
2017-12-06  2:52                         ` Noam Postavsky
2017-12-08  9:48               ` Eli Zaretskii
2017-12-08 13:32                 ` Noam Postavsky
2017-12-08 14:35                   ` Eli Zaretskii
2017-12-10  2:18                     ` Noam Postavsky
2017-12-10  6:49                       ` Eli Zaretskii
2018-01-27  2:05                         ` Noam Postavsky
2017-12-04 15:18           ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).