* bug#24914: 24.5; isearch-regexp: wrong error message @ 2016-11-09 22:29 Drew Adams 2017-12-03 16:37 ` Noam Postavsky 0 siblings, 1 reply; 21+ messages in thread From: Drew Adams @ 2016-11-09 22:29 UTC (permalink / raw) To: 24914 Do this in a large file: 1. C-M-s \(.\|^J\)+ That shows the error message: [error Stack overflow in regexp matcher]. OK, understandable. 2. C-M-s \(.\|^J\)\{,4000\}, where ^J is really a newline char, typed by using `C-q C-j'. No problem with that search. 3. C-M-s \(.\|^J\)\{,40000\} That shows the error message: [incomplete input], which is wrong, IMO. In GNU Emacs 24.5.1 (i686-pc-mingw32) of 2015-04-11 on LEG570 Windowing system distributor `Microsoft Corp.', version 6.1.7601 Configured using: `configure --prefix=/c/usr --host=i686-pc-mingw32' ^ permalink raw reply [flat|nested] 21+ messages in thread
* bug#24914: 24.5; isearch-regexp: wrong error message 2016-11-09 22:29 bug#24914: 24.5; isearch-regexp: wrong error message Drew Adams @ 2017-12-03 16:37 ` Noam Postavsky 2017-12-03 18:00 ` Drew Adams 0 siblings, 1 reply; 21+ messages in thread From: Noam Postavsky @ 2017-12-03 16:37 UTC (permalink / raw) To: Drew Adams; +Cc: 24914 [-- Attachment #1: Type: text/plain, Size: 500 bytes --] Drew Adams <drew.adams@oracle.com> writes: > 3. C-M-s \(.\|^J\)\{,40000\} > > That shows the error message: [incomplete input], which is wrong, IMO. The reason it doesn't work is because the number of repitions is limited to 32767 (#x7fff). Obviously that should be documented in the manual. As to the error message itself, there isn't really a way to distinguish between incomplete and invalid input, so the only thing I can see to do is to change the message to [incomplete or invalid input]. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: patch --] [-- Type: text/x-diff, Size: 1697 bytes --] From f7bb728281408170cfe79005b03d2b382a84cdbd Mon Sep 17 00:00:00 2001 From: Noam Postavsky <npostavs@gmail.com> Date: Sat, 2 Dec 2017 19:01:54 -0500 Subject: [PATCH] Document limitation of regexp repetition (Bug#24914) * doc/lispref/searching.texi (Regexp Backslash): Explain that \{m,n\} may only use numbers up to 32767. * lisp/isearch.el (isearch-search): Update error message to include invalid input possibility. --- doc/lispref/searching.texi | 3 ++- lisp/isearch.el | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi index 755fa554bb..92b7e6d17e 100644 --- a/doc/lispref/searching.texi +++ b/doc/lispref/searching.texi @@ -639,7 +639,8 @@ Regexp Backslash is a more general postfix operator that specifies repetition with a minimum of @var{m} repeats and a maximum of @var{n} repeats. If @var{m} is omitted, the minimum is 0; if @var{n} is omitted, there is no -maximum. +maximum. For both forms, @var{m} and @var{n}, if specified, may be no +larger than 32767. For example, @samp{c[ad]\@{1,2\@}r} matches the strings @samp{car}, @samp{cdr}, @samp{caar}, @samp{cadr}, @samp{cdar}, and @samp{cddr}, and diff --git a/lisp/isearch.el b/lisp/isearch.el index 13fa97ea71..dfc5f9f3f7 100644 --- a/lisp/isearch.el +++ b/lisp/isearch.el @@ -2853,7 +2853,7 @@ isearch-search ((string-match "\\`Premature \\|\\`Unmatched \\|\\`Invalid " isearch-error) - (setq isearch-error "incomplete input")) + (setq isearch-error "incomplete or invalid input")) ((and (not isearch-regexp) (string-match "\\`Regular expression too big" isearch-error)) (cond -- 2.11.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* bug#24914: 24.5; isearch-regexp: wrong error message 2017-12-03 16:37 ` Noam Postavsky @ 2017-12-03 18:00 ` Drew Adams 2017-12-03 18:13 ` Noam Postavsky 0 siblings, 1 reply; 21+ messages in thread From: Drew Adams @ 2017-12-03 18:00 UTC (permalink / raw) To: Noam Postavsky; +Cc: 24914 > > 3. C-M-s \(.\|^J\)\{,40000\} > > That shows the error message: [incomplete input], > > which is wrong, IMO. > > The reason it doesn't work is because the number of repitions > is limited to 32767 (#x7fff). Yet another case for adding bignums to Emacs Lisp? I imagine someone will answer that there needs to be a limit. In that case, can we not use something larger? Could we use the value of `most-positive-fixnum'? > Obviously that should be documented in the manual. Yes, please. > As to the error message itself, there isn't really a way > to distinguish between incomplete and invalid input, We do that in some places in the code. Some code parses the regexp, and that code must know (or be able to know) both that the regexp is not incomplete and that the numeral given for the number of repetitions is too large. > so the only thing I can see to do > is to change the message to [incomplete or invalid input]. I suppose that's better than [incomplete], but it doesn't really help users very much. Can we please do that but keep this bug open, hoping that someone will someday provide a real fix? ^ permalink raw reply [flat|nested] 21+ messages in thread
* bug#24914: 24.5; isearch-regexp: wrong error message 2017-12-03 18:00 ` Drew Adams @ 2017-12-03 18:13 ` Noam Postavsky 2017-12-03 18:56 ` Drew Adams 0 siblings, 1 reply; 21+ messages in thread From: Noam Postavsky @ 2017-12-03 18:13 UTC (permalink / raw) To: Drew Adams; +Cc: 24914 Drew Adams <drew.adams@oracle.com> writes: >> > 3. C-M-s \(.\|^J\)\{,40000\} >> > That shows the error message: [incomplete input], >> > which is wrong, IMO. >> >> The reason it doesn't work is because the number of repitions >> is limited to 32767 (#x7fff). > > Yet another case for adding bignums to Emacs Lisp? > I imagine someone will answer that there needs to be > a limit. > > In that case, can we not use something larger? > Could we use the value of `most-positive-fixnum'? It's not a limit in Lisp, but in regex.c. >> As to the error message itself, there isn't really a way >> to distinguish between incomplete and invalid input, > > We do that in some places in the code. What places are those? > Some code parses the regexp, and that code must know (or be able to > know) both that the regexp is not incomplete What does it mean for a regexp to be incomplete or not? As far as I can tell, the only distinction is that the user means to type more; but the code doesn't know what will happen in the future... > and that the numeral > given for the number of repetitions is too large. I suppose we could change regex.c to give a different error message for a repetition number that is too high, and then isearch.el could check for that specially. ^ permalink raw reply [flat|nested] 21+ messages in thread
* bug#24914: 24.5; isearch-regexp: wrong error message 2017-12-03 18:13 ` Noam Postavsky @ 2017-12-03 18:56 ` Drew Adams 2017-12-04 6:27 ` Noam Postavsky 0 siblings, 1 reply; 21+ messages in thread From: Drew Adams @ 2017-12-03 18:56 UTC (permalink / raw) To: Noam Postavsky; +Cc: 24914 > >> > 3. C-M-s \(.\|^J\)\{,40000\} > >> > That shows the error message: [incomplete input], > >> > which is wrong, IMO. > >> > >> The reason it doesn't work is because the number of repitions > >> is limited to 32767 (#x7fff). > > > > Yet another case for adding bignums to Emacs Lisp? > > I imagine someone will answer that there needs to be > > a limit. > > > > In that case, can we not use something larger? > > Could we use the value of `most-positive-fixnum'? > > It's not a limit in Lisp, but in regex.c. We can't use something larger there? > >> As to the error message itself, there isn't really a way > >> to distinguish between incomplete and invalid input, > > > > We do that in some places in the code. > > What places are those? In the Lisp code, at least, there are a few places where we provide an error that is specific to an invalid regexp. Search for handling of standard error `invalid-regexp', for instance. But if this is handled only in C code then you might want to look there instead. > > Some code parses the regexp, and that code must know (or be able to > > know) both that the regexp is not incomplete > > What does it mean for a regexp to be incomplete or not? As far as I can > tell, the only distinction is that the user means to type more; but the > code doesn't know what will happen in the future... Presumably that term is used only for cases where we can be sure that in order for the regexp to be valid there would need to be further input. `foo' is not incomplete, whether or not the user "means to type more". `[^' is incomplete, because it can be made valid only by typing more. > > and that the numeral > > given for the number of repetitions is too large. > > I suppose we could change regex.c to give a different error message for > a repetition number that is too high, and then isearch.el could check > for that specially. That would be great. ^ permalink raw reply [flat|nested] 21+ messages in thread
* bug#24914: 24.5; isearch-regexp: wrong error message 2017-12-03 18:56 ` Drew Adams @ 2017-12-04 6:27 ` Noam Postavsky 2017-12-04 14:52 ` Drew Adams 2017-12-04 15:18 ` Eli Zaretskii 0 siblings, 2 replies; 21+ messages in thread From: Noam Postavsky @ 2017-12-04 6:27 UTC (permalink / raw) To: Drew Adams; +Cc: 24914 Drew Adams <drew.adams@oracle.com> writes: >> It's not a limit in Lisp, but in regex.c. > > We can't use something larger there? Hmm, right, actually I see in regex.h: /* If sizeof(int) == 2, then ((1 << 15) - 1) overflows. */ #define RE_DUP_MAX (0x7fff) Does Emacs even support 16 bit platforms? >> >> As to the error message itself, there isn't really a way >> >> to distinguish between incomplete and invalid input, >> > >> > We do that in some places in the code. >> >> What places are those? > > In the Lisp code, at least, there are a few places where > we provide an error that is specific to an invalid regexp. > Search for handling of standard error `invalid-regexp', > for instance. As far as I can tell, none of those places (apart from isearch.el, the subject of this bug) try to flag "incomplete" regexps, only invalid or valid. >> > Some code parses the regexp, and that code must know (or be able to >> > know) both that the regexp is not incomplete >> >> What does it mean for a regexp to be incomplete or not? As far as I can >> tell, the only distinction is that the user means to type more; but the >> code doesn't know what will happen in the future... > > Presumably that term is used only for cases where we can > be sure that in order for the regexp to be valid there > would need to be further input. `foo' is not incomplete, > whether or not the user "means to type more". `[^' is > incomplete, because it can be made valid only by typing > more. Is `\\{100,20\\}' incomplete? Because it could be made valid by the user adding a 0 after the 20 to become '\\{100,200\\}'. Actually, I'm wondering what's the point of isearch showing "incomplete" instead of the actual regexp invalid error. I.e., why not instead of \ [incomplete] \{ [incomplete] \{4 [incomplete] \{4000 [incomplete] \{4000\ [incomplete] \{4000\} show this: \ [Trailing backslash] \{ [Unmatched \{] \{4 [Unmatched \{] \{4000 [Unmatched \{] \{4000\ [Trailing backslash] \{4000\} ^ permalink raw reply [flat|nested] 21+ messages in thread
* bug#24914: 24.5; isearch-regexp: wrong error message 2017-12-04 6:27 ` Noam Postavsky @ 2017-12-04 14:52 ` Drew Adams 2017-12-05 1:18 ` Noam Postavsky 2017-12-04 15:18 ` Eli Zaretskii 1 sibling, 1 reply; 21+ messages in thread From: Drew Adams @ 2017-12-04 14:52 UTC (permalink / raw) To: Noam Postavsky; +Cc: 24914 > >> >> As to the error message itself, there isn't really a way > >> >> to distinguish between incomplete and invalid input, > >> > > >> > We do that in some places in the code. > >> > >> What places are those? > > > > In the Lisp code, at least, there are a few places where > > we provide an error that is specific to an invalid regexp. > > Search for handling of standard error `invalid-regexp', > > for instance. > > As far as I can tell, none of those places (apart from isearch.el, the > subject of this bug) try to flag "incomplete" regexps, only invalid or > valid. Isn't that the point? In the case in question the regexp is not incomplete. It is "invalid" because the occurrences count is too high. Showing a message that says it is incomplete is wrong - that was the point of this report. What I cited are cases where we do flag _particular kinds_ of invalid regexps, and so tailor the error msg. That's what could be hoped for in the current case too: ideally, show a msg that says that the occurrences count is too high. If that can't be detected exactly then perhaps we can get close - e.g., invalid occurrences count or some such. > >> > Some code parses the regexp, and that code must know > >> > (or be able to know) both that the regexp is not incomplete > >> > >> What does it mean for a regexp to be incomplete or not? > >> As far as I can tell, the only distinction is that the > >> user means to type more; but the code doesn't know what > >> will happen in the future... > > > > Presumably that term is used only for cases where we can > > be sure that in order for the regexp to be valid there > > would need to be further input. `foo' is not incomplete, > > whether or not the user "means to type more". `[^' is > > incomplete, because it can be made valid only by typing > > more. > > Is `\\{100,20\\}' incomplete? Because it could be made valid > by the user adding a 0 after the 20 to become '\\{100,200\\}'. Of course, a user could always use `M-e' to edit the search pattern and type 0 before the \\}. But our isearch messages don't take that kind of thing into account. They assume the cursor is at the _end_ of the search pattern, so that further input is appended to the pattern. An incomplete-regexp message means (so far, aside from bugs like this one or perhaps cases where Emacs cannot do better) that we expect you to keep typing - at the end of the search pattern, to complete a valid regexp. > Actually, I'm wondering what's the point of isearch showing > "incomplete" instead of the actual regexp invalid error. > I.e., why not instead of > > \ [incomplete] > \{ [incomplete] > \{4 [incomplete] > \{4000 [incomplete] > \{4000\ [incomplete] > \{4000\} > > show this: > > \ [Trailing backslash] > \{ [Unmatched \{] > \{4 [Unmatched \{] > \{4000 [Unmatched \{] > \{4000\ [Trailing backslash] > \{4000\} Feel free to work on that. You might run into some cases that are not so clear-cut. But you might well improve things generally in some way. The problem with that is (I suppose) that it is not, in general, straightforward what would be needed to make the current pattern a valid regexp. In particular, there might be multiple ways to make it valid. Trying to describe what you're expecting, as possible appended input that would make for a valid regexp, would be hard. And doing it accurately, even when feasible, would lead to complex error msgs. It's maybe more user-friendly to just indicate that, so far, the regexp is not valid, but that it could become valid by appending something (i.e., without trying to accurately characterize that something). Anyway, unless working on that is needed or appropriate for fixing the reported bug, that should perhaps be dealt with by a separate bug (enhancement request). ^ permalink raw reply [flat|nested] 21+ messages in thread
* bug#24914: 24.5; isearch-regexp: wrong error message 2017-12-04 14:52 ` Drew Adams @ 2017-12-05 1:18 ` Noam Postavsky 2017-12-05 3:15 ` Drew Adams 2017-12-08 9:48 ` Eli Zaretskii 0 siblings, 2 replies; 21+ messages in thread From: Noam Postavsky @ 2017-12-05 1:18 UTC (permalink / raw) To: Drew Adams; +Cc: 24914 Drew Adams <drew.adams@oracle.com> writes: > What I cited are cases where we do flag _particular kinds_ > of invalid regexps, and so tailor the error msg. I'm not sure if you're citing actual code we have right now, or just some hypotheticals. In isearch.el, we pretty much do the opposite of tailor the error message. >> Actually, I'm wondering what's the point of isearch showing >> "incomplete" instead of the actual regexp invalid error. >> I.e., why not instead of >> >> \ [incomplete] >> \{ [incomplete] >> \{4 [incomplete] >> \{4000 [incomplete] >> \{4000\ [incomplete] >> \{4000\} >> >> show this: >> >> \ [Trailing backslash] >> \{ [Unmatched \{] >> \{4 [Unmatched \{] >> \{4000 [Unmatched \{] >> \{4000\ [Trailing backslash] >> \{4000\} > > Feel free to work on that. You might run into some cases > that are not so clear-cut. But you might well improve > things generally in some way. I meant just the following patch, you can try it out easily: --- i/lisp/isearch.el +++ w/lisp/isearch.el @@ -2850,10 +2850,6 @@ isearch-search (invalid-regexp (setq isearch-error (car (cdr lossage))) (cond - ((string-match - "\\`Premature \\|\\`Unmatched \\|\\`Invalid " - isearch-error) - (setq isearch-error "incomplete input")) ((and (not isearch-regexp) (string-match "\\`Regular expression too big" isearch-error)) (cond Eli Zaretskii <eliz@gnu.org> writes: >> >> /* If sizeof(int) == 2, then ((1 << 15) - 1) overflows. */ >> #define RE_DUP_MAX (0x7fff) >> >> Does Emacs even support 16 bit platforms? > > Emacs never did (the MS-DOS port of Emacs runs in i386 32-bit > protected mode on top of a 16-bit OS). But regex.c did, at some very > distant past, to support the 16-bit MS compiler, or at least it tried > to. So changing to 2^31 as the max should be fine, right? --- i/src/regex.h +++ w/src/regex.h @@ -270,8 +270,10 @@ #ifdef RE_DUP_MAX # undef RE_DUP_MAX #endif -/* If sizeof(int) == 2, then ((1 << 15) - 1) overflows. */ -#define RE_DUP_MAX (0x7fff) +/* If sizeof(int) == 4, then ((1 << 31) - 1) overflows. This used to + be limited to 0x7fff, but Emacs never supported 16 bit platforms + anyway. */ +#define RE_DUP_MAX (0x7fffffff) /* POSIX `cflags' bits (i.e., information for `regcomp'). */ ^ permalink raw reply [flat|nested] 21+ messages in thread
* bug#24914: 24.5; isearch-regexp: wrong error message 2017-12-05 1:18 ` Noam Postavsky @ 2017-12-05 3:15 ` Drew Adams 2017-12-05 3:51 ` Noam Postavsky 2017-12-08 9:48 ` Eli Zaretskii 1 sibling, 1 reply; 21+ messages in thread From: Drew Adams @ 2017-12-05 3:15 UTC (permalink / raw) To: Noam Postavsky; +Cc: 24914 > > What I cited are cases where we do flag _particular kinds_ > > of invalid regexps, and so tailor the error msg. > > I'm not sure if you're citing actual code we have right now, or just > some hypotheticals. In isearch.el, we pretty much do the opposite of > tailor the error message. I was citing what I thought were such cases in the current isearch.el code - cases where we do not just say "Invalid Regexp". We say things like this: Too many words Too many spaces for whitespace matching Unmatched [ or [^ Granted, the last is used only in `isearch-query-replace. My point was that in some existing cases (not many, admittedly), we do try to give a more precise error message when signal `invalid-regexp' is detected. But I'm not sure what you're arguing, if you are arguing. Certainly we don't tailor the message _much_ for the kind of `invalid-regexp'. But we do make some effort to do that, even now, AFAICT. > >> Actually, I'm wondering what's the point of isearch showing > >> "incomplete" instead of the actual regexp invalid error. > >> I.e., why not instead of > >> > >> \ [incomplete] > >> \{ [incomplete] > >> \{4 [incomplete] > >> \{4000 [incomplete] > >> \{4000\ [incomplete] > >> \{4000\} > >> > >> show this: > >> > >> \ [Trailing backslash] > >> \{ [Unmatched \{] > >> \{4 [Unmatched \{] > >> \{4000 [Unmatched \{] > >> \{4000\ [Trailing backslash] > >> \{4000\} > > I meant just the following patch, you can try it out easily: > (invalid-regexp > (setq isearch-error (car (cdr lossage))) > (cond > - ((string-match > - "\\`Premature \\|\\`Unmatched \\|\\`Invalid " > - isearch-error) > - (setq isearch-error "incomplete input")) > ((and (not isearch-regexp) > (string-match "\\`Regular expression too big" isearch-error)) > (cond You mean show "[Invalid content of \{\}]" in all cases? _Never_ show "[incomplete input]"? Why would that be better? Anyway, I don't have a strong opinion about that. I do think that in the case reported it's too bad that we say "[incomplete input]". But I don't think it follows that it would be more helpful to most users to show the invalid-regexp description in cases where Emacs can really tell that the input is necessarily incomplete. I suspect that it is quite common for that "incomplete input" message to be helpful. ^ permalink raw reply [flat|nested] 21+ messages in thread
* bug#24914: 24.5; isearch-regexp: wrong error message 2017-12-05 3:15 ` Drew Adams @ 2017-12-05 3:51 ` Noam Postavsky 2017-12-05 4:52 ` Drew Adams 0 siblings, 1 reply; 21+ messages in thread From: Noam Postavsky @ 2017-12-05 3:51 UTC (permalink / raw) To: Drew Adams; +Cc: 24914 Drew Adams <drew.adams@oracle.com> writes: > I was citing what I thought were such cases in the current > isearch.el code - cases where we do not just say "Invalid > Regexp". We say things like this: > > Too many words > Too many spaces for whitespace matching In that case, the regexp is constructed by Emacs on behalf of the user, so it makes sense to translate the error message so that it matches the user operation. > Unmatched [ or [^ AFAICT, the isearch.el code doesn't write that message, but rather reads it. > Granted, the last is used only in `isearch-query-replace. > > My point was that in some existing cases (not many, > admittedly), we do try to give a more precise error > message when signal `invalid-regexp' is detected. > > But I'm not sure what you're arguing, if you are arguing. The C regex code produces the following error messages when parsing regexps: { gettext_noop ("Success"), /* REG_NOERROR */ gettext_noop ("No match"), /* REG_NOMATCH */ gettext_noop ("Invalid regular expression"), /* REG_BADPAT */ gettext_noop ("Invalid collation character"), /* REG_ECOLLATE */ gettext_noop ("Invalid character class name"), /* REG_ECTYPE */ gettext_noop ("Trailing backslash"), /* REG_EESCAPE */ gettext_noop ("Invalid back reference"), /* REG_ESUBREG */ gettext_noop ("Unmatched [ or [^"), /* REG_EBRACK */ gettext_noop ("Unmatched ( or \\("), /* REG_EPAREN */ gettext_noop ("Unmatched \\{"), /* REG_EBRACE */ gettext_noop ("Invalid content of \\{\\}"), /* REG_BADBR */ gettext_noop ("Invalid range end"), /* REG_ERANGE */ gettext_noop ("Memory exhausted"), /* REG_ESPACE */ gettext_noop ("Invalid preceding regular expression"), /* REG_BADRPT */ gettext_noop ("Premature end of regular expression"), /* REG_EEND */ gettext_noop ("Regular expression too big"), /* REG_ESIZE */ gettext_noop ("Unmatched ) or \\)"), /* REG_ERPAREN */ gettext_noop ("Range striding over charsets"), /* REG_ERANGEX */ }; When doing isearch-*-regexp, most of those errors are converted into "incomplete" (i.e., *less* precise). But I think it would be more helpful to show the original error message. > But I don't think it follows that it would be more helpful to > most users to show the invalid-regexp description in cases > where Emacs can really tell that the input is necessarily > incomplete. I suspect that it is quite common for that > "incomplete input" message to be helpful. How does it help (compared to the more precise message)? Seems to me that telling the user they haven't finished entering the regexp isn't especially helpful; surely the user already knows they haven't finished typing yet. ^ permalink raw reply [flat|nested] 21+ messages in thread
* bug#24914: 24.5; isearch-regexp: wrong error message 2017-12-05 3:51 ` Noam Postavsky @ 2017-12-05 4:52 ` Drew Adams 2017-12-05 13:27 ` Noam Postavsky 0 siblings, 1 reply; 21+ messages in thread From: Drew Adams @ 2017-12-05 4:52 UTC (permalink / raw) To: Noam Postavsky; +Cc: 24914 > gettext_noop ("Success"), /* REG_NOERROR */ > gettext_noop ("No match"), /* REG_NOMATCH */ > gettext_noop ("Invalid regular expression"), /* REG_BADPAT */ > gettext_noop ("Invalid collation character"), /* REG_ECOLLATE */ > gettext_noop ("Invalid character class name"), /* REG_ECTYPE */ > gettext_noop ("Trailing backslash"), /* REG_EESCAPE */ > gettext_noop ("Invalid back reference"), /* REG_ESUBREG */ > gettext_noop ("Unmatched [ or [^"), /* REG_EBRACK */ > gettext_noop ("Unmatched ( or \\("), /* REG_EPAREN */ > gettext_noop ("Unmatched \\{"), /* REG_EBRACE */ > gettext_noop ("Invalid content of \\{\\}"), /* REG_BADBR */ > gettext_noop ("Invalid range end"), /* REG_ERANGE */ > gettext_noop ("Memory exhausted"), /* REG_ESPACE */ > gettext_noop ("Invalid preceding regular expression"), /* REG_BADRPT > */ > gettext_noop ("Premature end of regular expression"), /* REG_EEND */ > gettext_noop ("Regular expression too big"), /* REG_ESIZE */ > gettext_noop ("Unmatched ) or \\)"), /* REG_ERPAREN */ > gettext_noop ("Range striding over charsets"), /* REG_ERANGEX */ > > When doing isearch-*-regexp, most of those errors are converted into > "incomplete" (i.e., *less* precise). But I think it would be more > helpful to show the original error message. Agreed. Unless Emacs can be sure that in some context one of them really means that the regexp is not complete. If the user can just keep typing to make a valid regexp then an error msg (premature, not yet warranted) typically hurts more than it helps, I think. But if Emacs can't tell that, than sure, why not? Timing can mean a lot also (but depends on the user and how much thinking is going on). It's not great to interrupt immediately with an error msg if the user was still typing. And clearly some of those error conditions do _not_ end up translated as "incomplete input" messages - or they should not, in any case. Clearly someone made a decision about "Trailing backslash", for instance, and it's a very good decision IMO. That's a more useful "the-pattern-is-incomplete" message than just saying "incomplete input". We are not the first to consider the list of error conditions and what to do about this one or that one. That doesn't imply that the judgment made previously is the best one. It does suggest perhaps consulting those who might have made it, or the larger emacs-devel community. The behavior could be completely one-sided one way or the other way. Or it could be any kind of mix in between. It is currently a mix, but obviously not a perfect one - hence this bug report. Which tradeoff is best? > > But I don't think it follows that it would be more helpful to > > most users to show the invalid-regexp description in cases > > where Emacs can really tell that the input is necessarily > > incomplete. I suspect that it is quite common for that > > "incomplete input" message to be helpful. > > How does it help (compared to the more precise message)? See above. Isearch is incremental: you don't just enter a complete regexp once and for all (as in `grep', for instance. If Emacs jumps the gun with a premature "error" msg, that can be annoying, no? > Seems to me > that telling the user they haven't finished entering the regexp isn't > especially helpful; surely the user already knows they haven't finished > typing yet. No, _not_ surely - I think. Many (most? maybe, maybe not) users are not all that positive about using Emacs regexps. They should be able to interact with Isearch regexps interactively, incrementally. And yes, I think that it can be helpful to let a user know that the current pattern is incomplete as a regexp. But hey, users are different. I'd argue that we could add an option, with the default setting the current behavior (as I expect it is those less experienced that the "incomplete" behavior could benefit the most, and those more experience who could benefit most from the specific "invalid" messages). The latter are probably also the ones most likely to try more complicated regexps, which benefit the most, I expect, from precise "invalid" messages. Adding the behavior you propose as an option shouldn't hurt. But even for that you might want to propose it at emacs-devel. There might be people there more familiar with different use cases or who know more about the history of why the current behavior is as it is. Just a suggestion. ^ permalink raw reply [flat|nested] 21+ messages in thread
* bug#24914: 24.5; isearch-regexp: wrong error message 2017-12-05 4:52 ` Drew Adams @ 2017-12-05 13:27 ` Noam Postavsky 2017-12-05 15:31 ` Drew Adams 0 siblings, 1 reply; 21+ messages in thread From: Noam Postavsky @ 2017-12-05 13:27 UTC (permalink / raw) To: Drew Adams; +Cc: 24914 Drew Adams <drew.adams@oracle.com> writes: > Clearly someone made a decision about "Trailing backslash", > for instance, and it's a very good decision IMO. That's a > more useful "the-pattern-is-incomplete" message than just > saying "incomplete input". Right, but I can't see why the same shouldn't apply to "Unmatched \\{" and all the others. > We are not the first to consider the list of error > conditions and what to do about this one or that one. > That doesn't imply that the judgment made previously is > the best one. It does suggest perhaps consulting those > who might have made it, or the larger emacs-devel community. That code seems to have been there since 1992 "Initial revision", so it's not clear what, if any, considerations were made. > See above. Isearch is incremental: you don't just enter > a complete regexp once and for all (as in `grep', for > instance. If Emacs jumps the gun with a premature "error" > msg, that can be annoying, no? We already get an "error" message, it says "incomplete". The question is why shouldn't it instead say *why* it's incomplete. > Adding the behavior you propose as an option shouldn't > hurt. It hurts, because it adds yet another option, which makes a user's job of going through them and deciding which ones make sense that much harder (yes, just this particular addded option won't make much difference, but still). >There might be people there more familiar with different use cases or >who know more about the history of why the current behavior is as it >is. I hope so. ^ permalink raw reply [flat|nested] 21+ messages in thread
* bug#24914: 24.5; isearch-regexp: wrong error message 2017-12-05 13:27 ` Noam Postavsky @ 2017-12-05 15:31 ` Drew Adams 2017-12-06 2:52 ` Noam Postavsky 0 siblings, 1 reply; 21+ messages in thread From: Drew Adams @ 2017-12-05 15:31 UTC (permalink / raw) To: Noam Postavsky; +Cc: 24914 > > Clearly someone made a decision about "Trailing backslash", > > for instance, and it's a very good decision IMO. That's a > > more useful "the-pattern-is-incomplete" message than just > > saying "incomplete input". > > Right, but I can't see why the same shouldn't apply to > "Unmatched \\{" and all the others. Treating them all the same is one possibility. Not the best one, I think, but one possibility. > > We are not the first to consider the list of error > > conditions and what to do about this one or that one. > > That doesn't imply that the judgment made previously is > > the best one. It does suggest perhaps consulting those > > who might have made it, or the larger emacs-devel community. > > That code seems to have been there since 1992 "Initial > revision", so it's not clear what, if any, considerations > were made. It might not be clear, but that doesn't mean there weren't good reasons for that judgment. (And no, I'm not saying that a different judgment can't be made now.) And certainly some of those around then, including deciders probably, are still around today. Perhaps RMS has an opinion or recollection about this? > > See above. Isearch is incremental: you don't just enter > > a complete regexp once and for all (as in `grep', for > > instance. If Emacs jumps the gun with a premature "error" > > msg, that can be annoying, no? > > We already get an "error" message, it says "incomplete". > The question is why shouldn't it instead say *why* it's > incomplete. I thought your proposal was to, in all cases, eliminate saying it is incomplete in favor of the specific regexp-invalidity error text. Such error text does not, generally and directly, tell users that the input is incomplete. Users very familiar with regexps might understand that such a msg implies that input is incomplete, but not everyone will get that. > > Adding the behavior you propose as an option shouldn't > > hurt. > > It hurts, because it adds yet another option, which makes a user's job > of going through them and deciding which ones make sense that much > harder (yes, just this particular addded option won't make much > difference, but still). Users who are very familiar with Emacs regexps will be the ones to benefit most from the specific regexp-validity msgs, IMO. They should have no problem customizing an option. Users unfamiliar with Emacs or Emacs regexps will get the simpler default behavior: your input pattern is not yet complete. If you feel strongly about this and are opposed to adding a user option, consider proposing the change you want to emacs-devel. This particular bug is about one case: just the particular "incomplete input" message case cited. Fixing this bug shouldn't require changing the basic behavior, though that is certainly one possibility. You apparently think there is never any value in telling users that the input pattern is not complete as a regexp. I disagree. We apparently agree that at least in some cases the specific regexp-invalidity message is more helpful. There is a spectrum of possibilities here. I see no special reason why the right approach should be all or none. > > There might be people there more familiar with different > > use cases or who know more about the history of why the > > current behavior is as it is. > > I hope so. ^ permalink raw reply [flat|nested] 21+ messages in thread
* bug#24914: 24.5; isearch-regexp: wrong error message 2017-12-05 15:31 ` Drew Adams @ 2017-12-06 2:52 ` Noam Postavsky 0 siblings, 0 replies; 21+ messages in thread From: Noam Postavsky @ 2017-12-06 2:52 UTC (permalink / raw) To: Drew Adams; +Cc: 24914 [-- Attachment #1: Type: text/plain, Size: 857 bytes --] tags 24914 + patch quit Drew Adams <drew.adams@oracle.com> writes: > Such error text does not, generally and directly, tell > users that the input is incomplete. Users very familiar > with regexps might understand that such a msg implies > that input is incomplete, but not everyone will get that. Hmm, I hadn't considered that possibility, but I will allow that *could* be a symptom of my being overly familiar with regexp syntax. > You apparently think there is never any value in > telling users that the input pattern is not > complete as a regexp. I disagree. We apparently > agree that at least in some cases the specific > regexp-invalidity message is more helpful. Okay, I've looked at the error messages a bit more closely, and I believe all the "Invalid ..." ones should never be considered "incomplete". See commit message for details. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: patch --] [-- Type: text/x-diff, Size: 5656 bytes --] From 1d32f4d28521a143c333ef4cc125419661e3a3a9 Mon Sep 17 00:00:00 2001 From: Noam Postavsky <npostavs@gmail.com> Date: Sat, 2 Dec 2017 19:01:54 -0500 Subject: [PATCH] Raise limit of regexp repetition (Bug#24914) * src/regex.h (RE_DUP_MAX): Raise limit to 2^32-1. * etc/NEWS: Announce it. * doc/lispref/searching.texi (Regexp Backslash): Document it. * src/regex.h (reg_errcode_t): Add REG_ESIZEBR code. * src/regex.c (re_error_msgid): Add corresponding entry. (GET_INTERVAL_COUNT): Return it instead of the more generic REG_EBADBR when encountering a repetition greater than RE_DUP_MAX. * lisp/isearch.el (isearch-search): Don't convert errors starting with "Invalid" into "incomplete". Such errors are not incomplete, in the sense that they cannot be corrected by appending more characters to the end of the regexp. The affected error messages are: - REG_BADPAT "Invalid regular expression" - \\(?X:\\) where X is not a legal group number - \\_X where X is not < or > - REG_ECOLLATE "Invalid collation character" - There is no code to throw this. - REG_ECTYPE "Invalid character class name" - [[:foo:] where foo is not a valid class name - REG_ESUBREG "Invalid back reference" - \N where N is referenced before matching group N - REG_BADBR "Invalid content of \\{\\}" - \\{N,M\\} where N < 0, M < N, M or N larger than max - \\{NX where X is not a digit or backslash - \\{N\\X where X is not a } - REG_ERANGE "Invalid range end" - There is no code to throw this. - REG_BADRPT "Invalid preceding regular expression" - We never throw this. It would usually indicate a "*" with no preceding regexp text, but Emacs allows that to match a literal "*". --- doc/lispref/searching.texi | 9 ++++++++- etc/NEWS | 8 ++++++++ lisp/isearch.el | 2 +- src/regex.c | 5 +++-- src/regex.h | 9 ++++++--- 5 files changed, 26 insertions(+), 7 deletions(-) diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi index 755fa554bb..724d66b5e3 100644 --- a/doc/lispref/searching.texi +++ b/doc/lispref/searching.texi @@ -639,7 +639,14 @@ Regexp Backslash is a more general postfix operator that specifies repetition with a minimum of @var{m} repeats and a maximum of @var{n} repeats. If @var{m} is omitted, the minimum is 0; if @var{n} is omitted, there is no -maximum. +maximum. For both forms, @var{m} and @var{n}, if specified, may be no +larger than +@ifnottex +2**31 @minus{} 1 +@end ifnottex +@tex +@math{2^{31}-1} +@end tex For example, @samp{c[ad]\@{1,2\@}r} matches the strings @samp{car}, @samp{cdr}, @samp{caar}, @samp{cadr}, @samp{cdar}, and @samp{cddr}, and diff --git a/etc/NEWS b/etc/NEWS index 4ccf468693..579cad058e 100644 --- a/etc/NEWS +++ b/etc/NEWS @@ -509,6 +509,14 @@ instead. ** The new user option 'arabic-shaper-ZWNJ-handling' controls how to handle ZWNJ in Arabic text rendering. ++++ +** The limit on repetitions in regexps has been raised to 2^31-1. +It was previously undocumented and limited to 2^15-1. For example, +the following regular expression was previously invalid, but is now +accepted: + + x\{32768\} + \f * Editing Changes in Emacs 26.1 diff --git a/lisp/isearch.el b/lisp/isearch.el index 13fa97ea71..093185a096 100644 --- a/lisp/isearch.el +++ b/lisp/isearch.el @@ -2851,7 +2851,7 @@ isearch-search (setq isearch-error (car (cdr lossage))) (cond ((string-match - "\\`Premature \\|\\`Unmatched \\|\\`Invalid " + "\\`Premature \\|\\`Unmatched " isearch-error) (setq isearch-error "incomplete input")) ((and (not isearch-regexp) diff --git a/src/regex.c b/src/regex.c index 330f2f78a8..ab74f457d4 100644 --- a/src/regex.c +++ b/src/regex.c @@ -1200,7 +1200,8 @@ WEAK_ALIAS (__re_set_syntax, re_set_syntax) gettext_noop ("Premature end of regular expression"), /* REG_EEND */ gettext_noop ("Regular expression too big"), /* REG_ESIZE */ gettext_noop ("Unmatched ) or \\)"), /* REG_ERPAREN */ - gettext_noop ("Range striding over charsets") /* REG_ERANGEX */ + gettext_noop ("Range striding over charsets"), /* REG_ERANGEX */ + gettext_noop ("Invalid content of \\{\\}, repetitions too big") /* REG_ESIZEBR */ }; \f /* Whether to allocate memory during matching. */ @@ -1921,7 +1922,7 @@ while (REMAINING_AVAIL_SLOTS <= space) { \ if (num < 0) \ num = 0; \ if (RE_DUP_MAX / 10 - (RE_DUP_MAX % 10 < c - '0') < num) \ - FREE_STACK_RETURN (REG_BADBR); \ + FREE_STACK_RETURN (REG_ESIZEBR); \ num = num * 10 + c - '0'; \ if (p == pend) \ FREE_STACK_RETURN (REG_EBRACE); \ diff --git a/src/regex.h b/src/regex.h index 9fa8356011..b829848586 100644 --- a/src/regex.h +++ b/src/regex.h @@ -270,8 +270,10 @@ #ifdef RE_DUP_MAX # undef RE_DUP_MAX #endif -/* If sizeof(int) == 2, then ((1 << 15) - 1) overflows. */ -#define RE_DUP_MAX (0x7fff) +/* If sizeof(int) == 4, then ((1 << 31) - 1) overflows. This used to + be limited to 0x7fff, but Emacs never supported 16 bit platforms + anyway. */ +#define RE_DUP_MAX (0x7fffffff) /* POSIX `cflags' bits (i.e., information for `regcomp'). */ @@ -337,7 +339,8 @@ REG_EEND, /* Premature end. */ REG_ESIZE, /* Compiled pattern bigger than 2^16 bytes. */ REG_ERPAREN, /* Unmatched ) or \); not returned from regcomp. */ - REG_ERANGEX /* Range striding over charsets. */ + REG_ERANGEX, /* Range striding over charsets. */ + REG_ESIZEBR /* n or m too big in \{n,m\} */ } reg_errcode_t; \f /* This data structure represents a compiled pattern. Before calling -- 2.11.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* bug#24914: 24.5; isearch-regexp: wrong error message 2017-12-05 1:18 ` Noam Postavsky 2017-12-05 3:15 ` Drew Adams @ 2017-12-08 9:48 ` Eli Zaretskii 2017-12-08 13:32 ` Noam Postavsky 1 sibling, 1 reply; 21+ messages in thread From: Eli Zaretskii @ 2017-12-08 9:48 UTC (permalink / raw) To: Noam Postavsky; +Cc: 24914 > From: Noam Postavsky <npostavs@users.sourceforge.net> > Date: Mon, 04 Dec 2017 20:18:02 -0500 > Cc: 24914@debbugs.gnu.org > > > Emacs never did (the MS-DOS port of Emacs runs in i386 32-bit > > protected mode on top of a 16-bit OS). But regex.c did, at some very > > distant past, to support the 16-bit MS compiler, or at least it tried > > to. > > So changing to 2^31 as the max should be fine, right? Or maybe just INT_MAX? ^ permalink raw reply [flat|nested] 21+ messages in thread
* bug#24914: 24.5; isearch-regexp: wrong error message 2017-12-08 9:48 ` Eli Zaretskii @ 2017-12-08 13:32 ` Noam Postavsky 2017-12-08 14:35 ` Eli Zaretskii 0 siblings, 1 reply; 21+ messages in thread From: Noam Postavsky @ 2017-12-08 13:32 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 24914 Eli Zaretskii <eliz@gnu.org> writes: >> From: Noam Postavsky <npostavs@users.sourceforge.net> >> Date: Mon, 04 Dec 2017 20:18:02 -0500 >> Cc: 24914@debbugs.gnu.org >> >> > Emacs never did (the MS-DOS port of Emacs runs in i386 32-bit >> > protected mode on top of a 16-bit OS). But regex.c did, at some very >> > distant past, to support the 16-bit MS compiler, or at least it tried >> > to. >> >> So changing to 2^31 as the max should be fine, right? > > Or maybe just INT_MAX? I thought it would be easier to document the limit if it's fixed across all machines. Otherwise we would have to say something like "For both forms, m and n, if specified, may be no larger than INT_MAX, which is usually 2**31 - 1, but could be 2**63 - 1 depending on the compiler used for building Emacs". ^ permalink raw reply [flat|nested] 21+ messages in thread
* bug#24914: 24.5; isearch-regexp: wrong error message 2017-12-08 13:32 ` Noam Postavsky @ 2017-12-08 14:35 ` Eli Zaretskii 2017-12-10 2:18 ` Noam Postavsky 0 siblings, 1 reply; 21+ messages in thread From: Eli Zaretskii @ 2017-12-08 14:35 UTC (permalink / raw) To: Noam Postavsky; +Cc: 24914 > From: Noam Postavsky <npostavs@users.sourceforge.net> > Cc: drew.adams@oracle.com, 24914@debbugs.gnu.org > Date: Fri, 08 Dec 2017 08:32:42 -0500 > > >> So changing to 2^31 as the max should be fine, right? > > > > Or maybe just INT_MAX? > > I thought it would be easier to document the limit if it's fixed across > all machines. Otherwise we would have to say something like "For both > forms, m and n, if specified, may be no larger than INT_MAX, which is > usually 2**31 - 1, but could be 2**63 - 1 depending on the compiler used > for building Emacs". Isn't int 32 bit wide everywhere? And anyway, since the bitmap is stored in an int, isn't INT_MAX TRT? But I'm not a language expert enough to argue; feel free to use your form if you think it's better. ^ permalink raw reply [flat|nested] 21+ messages in thread
* bug#24914: 24.5; isearch-regexp: wrong error message 2017-12-08 14:35 ` Eli Zaretskii @ 2017-12-10 2:18 ` Noam Postavsky 2017-12-10 6:49 ` Eli Zaretskii 0 siblings, 1 reply; 21+ messages in thread From: Noam Postavsky @ 2017-12-10 2:18 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 24914 [-- Attachment #1: Type: text/plain, Size: 2773 bytes --] Eli Zaretskii <eliz@gnu.org> writes: >> I thought it would be easier to document the limit if it's fixed across >> all machines. Otherwise we would have to say something like "For both >> forms, m and n, if specified, may be no larger than INT_MAX, which is >> usually 2**31 - 1, but could be 2**63 - 1 depending on the compiler used >> for building Emacs". > > Isn't int 32 bit wide everywhere? I might have been mixing up int with long when I was thinking about this; it seems only a few very obscure platforms have 64 bit ints. According to [1], everywhere but "HAL Computer Systems port of Solaris to the SPARC64" and "Classic UNICOS" has 32 bit ints. [1]: https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models > And anyway, since the bitmap is stored in an int, isn't INT_MAX TRT? Unfortunately, all this discussion of int size seems to be academic. I took another look at the code, there is another limit due to regexp opcode format. We can raise the limit to 2^16-1 though. Here is the use of RE_DUP_MAX, which makes it seem like int-size is the main limit: /* Get the next unsigned number in the uncompiled pattern. */ #define GET_INTERVAL_COUNT(num) \ ... if (RE_DUP_MAX / 10 - (RE_DUP_MAX % 10 < c - '0') < num) \ FREE_STACK_RETURN (REG_ESIZEBR); \ static reg_errcode_t regex_compile (const_re_char *pattern, size_t size, { ... int lower_bound = 0, upper_bound = -1; [...] GET_INTERVAL_COUNT (lower_bound); But then INSERT_JUMP2 (succeed_n, laststart, b + 5 + nbytes, lower_bound); /* Like `STORE_JUMP2', but for inserting. Assume `b' is the buffer end. */ #define INSERT_JUMP2(op, loc, to, arg) \ insert_op2 (op, loc, (to) - (loc) - 3, arg, b) /* Like `insert_op1', but for two two-byte parameters ARG1 and ARG2. */ ^^^^^^^^ static void insert_op2 (re_opcode_t op, unsigned char *loc, int arg1, int arg2, unsigned char *end) { ... store_op2 (op, loc, arg1, arg2); } /* Like `store_op1', but for two two-byte parameters ARG1 and ARG2. */ ^^^^^^^^ static void store_op2 (re_opcode_t op, unsigned char *loc, int arg1, int arg2) { *loc = (unsigned char) op; STORE_NUMBER (loc + 1, arg1); STORE_NUMBER (loc + 3, arg2); } /* Store NUMBER in two contiguous bytes starting at DESTINATION. */ ^^^^^^^^^^^^^^^^^^^^ #define STORE_NUMBER(destination, number) \ do { \ (destination)[0] = (number) & 0377; \ (destination)[1] = (number) >> 8; \ } while (0) Here is the updated patch: [-- Attachment #2: patch --] [-- Type: text/plain, Size: 6410 bytes --] From 6c3ead6bd5c61801915dcedbb8dd17622610a899 Mon Sep 17 00:00:00 2001 From: Noam Postavsky <npostavs@gmail.com> Date: Sat, 2 Dec 2017 19:01:54 -0500 Subject: [PATCH] Raise limit of regexp repetition (Bug#24914) * src/regex.h (RE_DUP_MAX): Raise limit to 2^16-1. * etc/NEWS: Announce it. * doc/lispref/searching.texi (Regexp Backslash): Document it. * test/src/regex-tests.el (regex-repeat-limit): Test it. * src/regex.h (reg_errcode_t): Add REG_ESIZEBR code. * src/regex.c (re_error_msgid): Add corresponding entry. (GET_INTERVAL_COUNT): Return it instead of the more generic REG_EBADBR when encountering a repetition greater than RE_DUP_MAX. * lisp/isearch.el (isearch-search): Don't convert errors starting with "Invalid" into "incomplete". Such errors are not incomplete, in the sense that they cannot be corrected by appending more characters to the end of the regexp. The affected error messages are: - REG_BADPAT "Invalid regular expression" - \\(?X:\\) where X is not a legal group number - \\_X where X is not < or > - REG_ECOLLATE "Invalid collation character" - There is no code to throw this. - REG_ECTYPE "Invalid character class name" - [[:foo:] where foo is not a valid class name - REG_ESUBREG "Invalid back reference" - \N where N is referenced before matching group N - REG_BADBR "Invalid content of \\{\\}" - \\{N,M\\} where N < 0, M < N, M or N larger than max - \\{NX where X is not a digit or backslash - \\{N\\X where X is not a } - REG_ERANGE "Invalid range end" - There is no code to throw this. - REG_BADRPT "Invalid preceding regular expression" - We never throw this. It would usually indicate a "*" with no preceding regexp text, but Emacs allows that to match a literal "*". --- doc/lispref/searching.texi | 10 +++++++++- etc/NEWS | 8 ++++++++ lisp/isearch.el | 2 +- src/regex.c | 5 +++-- src/regex.h | 9 ++++++--- test/src/regex-tests.el | 6 ++++++ 6 files changed, 33 insertions(+), 7 deletions(-) diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi index 755fa554bb..ab52cf2802 100644 --- a/doc/lispref/searching.texi +++ b/doc/lispref/searching.texi @@ -639,7 +639,15 @@ Regexp Backslash is a more general postfix operator that specifies repetition with a minimum of @var{m} repeats and a maximum of @var{n} repeats. If @var{m} is omitted, the minimum is 0; if @var{n} is omitted, there is no -maximum. +maximum. For both forms, @var{m} and @var{n}, if specified, may be no +larger than +@ifnottex +2**16 @minus{} 1 +@end ifnottex +@tex +@math{2^{16}-1} +@end tex +. For example, @samp{c[ad]\@{1,2\@}r} matches the strings @samp{car}, @samp{cdr}, @samp{caar}, @samp{cadr}, @samp{cdar}, and @samp{cddr}, and diff --git a/etc/NEWS b/etc/NEWS index 64b53d88c8..c7efc53f6a 100644 --- a/etc/NEWS +++ b/etc/NEWS @@ -509,6 +509,14 @@ instead. ** The new user option 'arabic-shaper-ZWNJ-handling' controls how to handle ZWNJ in Arabic text rendering. ++++ +** The limit on repetitions in regexps has been raised to 2^16-1. +It was previously undocumented and limited to 2^15-1. For example, +the following regular expression was previously invalid, but is now +accepted: + + x\{32768\} + \f * Editing Changes in Emacs 26.1 diff --git a/lisp/isearch.el b/lisp/isearch.el index 13fa97ea71..093185a096 100644 --- a/lisp/isearch.el +++ b/lisp/isearch.el @@ -2851,7 +2851,7 @@ isearch-search (setq isearch-error (car (cdr lossage))) (cond ((string-match - "\\`Premature \\|\\`Unmatched \\|\\`Invalid " + "\\`Premature \\|\\`Unmatched " isearch-error) (setq isearch-error "incomplete input")) ((and (not isearch-regexp) diff --git a/src/regex.c b/src/regex.c index 330f2f78a8..ab74f457d4 100644 --- a/src/regex.c +++ b/src/regex.c @@ -1200,7 +1200,8 @@ WEAK_ALIAS (__re_set_syntax, re_set_syntax) gettext_noop ("Premature end of regular expression"), /* REG_EEND */ gettext_noop ("Regular expression too big"), /* REG_ESIZE */ gettext_noop ("Unmatched ) or \\)"), /* REG_ERPAREN */ - gettext_noop ("Range striding over charsets") /* REG_ERANGEX */ + gettext_noop ("Range striding over charsets"), /* REG_ERANGEX */ + gettext_noop ("Invalid content of \\{\\}, repetitions too big") /* REG_ESIZEBR */ }; \f /* Whether to allocate memory during matching. */ @@ -1921,7 +1922,7 @@ while (REMAINING_AVAIL_SLOTS <= space) { \ if (num < 0) \ num = 0; \ if (RE_DUP_MAX / 10 - (RE_DUP_MAX % 10 < c - '0') < num) \ - FREE_STACK_RETURN (REG_BADBR); \ + FREE_STACK_RETURN (REG_ESIZEBR); \ num = num * 10 + c - '0'; \ if (p == pend) \ FREE_STACK_RETURN (REG_EBRACE); \ diff --git a/src/regex.h b/src/regex.h index 9fa8356011..4c8632d6aa 100644 --- a/src/regex.h +++ b/src/regex.h @@ -270,8 +270,10 @@ #ifdef RE_DUP_MAX # undef RE_DUP_MAX #endif -/* If sizeof(int) == 2, then ((1 << 15) - 1) overflows. */ -#define RE_DUP_MAX (0x7fff) +/* Repeat counts are stored in opcodes as 2 byte integers. This was + previously limited to 7fff because the parsing code uses signed + ints. But Emacs only runs on 32 bit platforms anyway. */ +#define RE_DUP_MAX (0xffff) /* POSIX `cflags' bits (i.e., information for `regcomp'). */ @@ -337,7 +339,8 @@ REG_EEND, /* Premature end. */ REG_ESIZE, /* Compiled pattern bigger than 2^16 bytes. */ REG_ERPAREN, /* Unmatched ) or \); not returned from regcomp. */ - REG_ERANGEX /* Range striding over charsets. */ + REG_ERANGEX, /* Range striding over charsets. */ + REG_ESIZEBR /* n or m too big in \{n,m\} */ } reg_errcode_t; \f /* This data structure represents a compiled pattern. Before calling diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el index b1f1ea71ce..872d16a085 100644 --- a/test/src/regex-tests.el +++ b/test/src/regex-tests.el @@ -677,4 +677,10 @@ regex-tests-TESTS This evaluates the TESTS test cases from glibc." (should-not (regex-tests-TESTS))) +(ert-deftest regex-repeat-limit () + "Test the #xFFFF repeat limit." + (should (string-match "\\`x\\{65535\\}" (make-string 65535 ?x))) + (should-not (string-match "\\`x\\{65535\\}" (make-string 65534 ?x))) + (should-error (string-match "\\`x\\{65536\\}" "X") :type invalid-regexp)) + ;;; regex-tests.el ends here -- 2.11.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* bug#24914: 24.5; isearch-regexp: wrong error message 2017-12-10 2:18 ` Noam Postavsky @ 2017-12-10 6:49 ` Eli Zaretskii 2018-01-27 2:05 ` Noam Postavsky 0 siblings, 1 reply; 21+ messages in thread From: Eli Zaretskii @ 2017-12-10 6:49 UTC (permalink / raw) To: Noam Postavsky; +Cc: 24914 > From: Noam Postavsky <npostavs@users.sourceforge.net> > Cc: drew.adams@oracle.com, 24914@debbugs.gnu.org > Date: Sat, 09 Dec 2017 21:18:05 -0500 > > Unfortunately, all this discussion of int size seems to be academic. I > took another look at the code, there is another limit due to regexp > opcode format. We can raise the limit to 2^16-1 though. > [...] > Here is the updated patch: LGTM for master. Thanks for the research and for the patch. ^ permalink raw reply [flat|nested] 21+ messages in thread
* bug#24914: 24.5; isearch-regexp: wrong error message 2017-12-10 6:49 ` Eli Zaretskii @ 2018-01-27 2:05 ` Noam Postavsky 0 siblings, 0 replies; 21+ messages in thread From: Noam Postavsky @ 2018-01-27 2:05 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 24914 tags 24914 fixed close 24914 27.1 quit Eli Zaretskii <eliz@gnu.org> writes: >> From: Noam Postavsky <npostavs@users.sourceforge.net> >> Cc: drew.adams@oracle.com, 24914@debbugs.gnu.org >> Date: Sat, 09 Dec 2017 21:18:05 -0500 >> >> Unfortunately, all this discussion of int size seems to be academic. I >> took another look at the code, there is another limit due to regexp >> opcode format. We can raise the limit to 2^16-1 though. >> [...] >> Here is the updated patch: > > LGTM for master. Thanks for the research and for the patch. Pushed to master [1: 559f160616], also documented limit in emacs-26 [2: 463f96b481]. [1: 559f160616]: 2018-01-26 20:49:44 -0500 Raise limit of regexp repetition (Bug#24914) https://git.savannah.gnu.org/cgit/emacs.git/commit/?id=559f1606166822394df3988c18c0ad02984ac675 [2: 463f96b481]: 2018-01-26 19:53:09 -0500 * doc/lispref/searching.texi: Document regexp repetition limit. https://git.savannah.gnu.org/cgit/emacs.git/commit/?id=463f96b4813fb77d88a7b0fa93f94aa08d71689f ^ permalink raw reply [flat|nested] 21+ messages in thread
* bug#24914: 24.5; isearch-regexp: wrong error message 2017-12-04 6:27 ` Noam Postavsky 2017-12-04 14:52 ` Drew Adams @ 2017-12-04 15:18 ` Eli Zaretskii 1 sibling, 0 replies; 21+ messages in thread From: Eli Zaretskii @ 2017-12-04 15:18 UTC (permalink / raw) To: Noam Postavsky; +Cc: 24914 > From: Noam Postavsky <npostavs@users.sourceforge.net> > Date: Mon, 04 Dec 2017 01:27:27 -0500 > Cc: 24914@debbugs.gnu.org > > Drew Adams <drew.adams@oracle.com> writes: > > >> It's not a limit in Lisp, but in regex.c. > > > > We can't use something larger there? > > Hmm, right, actually I see in regex.h: > > /* If sizeof(int) == 2, then ((1 << 15) - 1) overflows. */ > #define RE_DUP_MAX (0x7fff) > > Does Emacs even support 16 bit platforms? Emacs never did (the MS-DOS port of Emacs runs in i386 32-bit protected mode on top of a 16-bit OS). But regex.c did, at some very distant past, to support the 16-bit MS compiler, or at least it tried to. ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2018-01-27 2:05 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-11-09 22:29 bug#24914: 24.5; isearch-regexp: wrong error message Drew Adams 2017-12-03 16:37 ` Noam Postavsky 2017-12-03 18:00 ` Drew Adams 2017-12-03 18:13 ` Noam Postavsky 2017-12-03 18:56 ` Drew Adams 2017-12-04 6:27 ` Noam Postavsky 2017-12-04 14:52 ` Drew Adams 2017-12-05 1:18 ` Noam Postavsky 2017-12-05 3:15 ` Drew Adams 2017-12-05 3:51 ` Noam Postavsky 2017-12-05 4:52 ` Drew Adams 2017-12-05 13:27 ` Noam Postavsky 2017-12-05 15:31 ` Drew Adams 2017-12-06 2:52 ` Noam Postavsky 2017-12-08 9:48 ` Eli Zaretskii 2017-12-08 13:32 ` Noam Postavsky 2017-12-08 14:35 ` Eli Zaretskii 2017-12-10 2:18 ` Noam Postavsky 2017-12-10 6:49 ` Eli Zaretskii 2018-01-27 2:05 ` Noam Postavsky 2017-12-04 15:18 ` Eli Zaretskii
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).