* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". @ 2015-12-04 4:20 Alan Mackenzie 2015-12-04 9:23 ` Eli Zaretskii [not found] ` <mailman.1363.1449242229.31583.bug-gnu-emacs@gnu.org> 0 siblings, 2 replies; 26+ messages in thread From: Alan Mackenzie @ 2015-12-04 4:20 UTC (permalink / raw) To: 22090 Hello, Emacs With a recent emacs-25 (last update eaa1fd6dbff8346eb38485de5ebf0fbfacf374d9 from Thursday 2015-12-03): emacs -Q C-c C-f src/xdisp.c Move point to L30 (paragraph beginning "Updating the display is triggered by the Lisp interpreter ...") C-s C-w repeatedly, to yank words onto the search string. After ~29 words have been yanked, the response becomes sluggish, pausing for between 0.5s and 1s before highlighting the "for" at the end of L31. Carrying on with C-w, some words are taking 2 or 3 seconds to be registered by Isearch. This is Bad. After having yanked "you as part of" from L32, (i) the " of" gets highlighted in the isearch-error face in the echo area; (ii) the text "[Too many words]" is appended to the echo area; (iii) the highlighting is removed from the match; (iv) point is placed at the start of the match (i.e. BOL 30). At this point, <BKSP> will still behave as expected, except it's action too is very sluggish - to remove two words from the current search took several seconds. Observation: it may be that C-w done in the vicinity of two or several spaces experiences extra delay. -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". 2015-12-04 4:20 bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]" Alan Mackenzie @ 2015-12-04 9:23 ` Eli Zaretskii 2015-12-04 15:16 ` Artur Malabarba [not found] ` <mailman.1363.1449242229.31583.bug-gnu-emacs@gnu.org> 1 sibling, 1 reply; 26+ messages in thread From: Eli Zaretskii @ 2015-12-04 9:23 UTC (permalink / raw) To: Alan Mackenzie; +Cc: 22090 > Date: Fri, 4 Dec 2015 04:20:52 +0000 > From: Alan Mackenzie <acm@muc.de> > > With a recent emacs-25 (last update > eaa1fd6dbff8346eb38485de5ebf0fbfacf374d9 from Thursday 2015-12-03): > > emacs -Q > C-c C-f src/xdisp.c > Move point to L30 (paragraph beginning "Updating the display is triggered > by the Lisp interpreter ...") > > C-s > C-w repeatedly, to yank words onto the search string. > > After ~29 words have been yanked, the response becomes sluggish, pausing > for between 0.5s and 1s before highlighting the "for" at the end of L31. > > Carrying on with C-w, some words are taking 2 or 3 seconds to be > registered by Isearch. This is Bad. Here's a profile for this part: - command-execute 1762 99% - call-interactively 1762 99% - funcall-interactively 1762 99% - isearch-yank-word-or-char 1760 99% - isearch-yank-internal 1760 99% - isearch-yank-string 1760 99% - isearch-process-search-string 1760 99% - isearch-search-and-update 1760 99% - isearch-update 1760 99% - if 1760 99% - progn 1760 99% - while 1757 99% - let 1757 99% - isearch-lazy-highlight-search 1757 99% - condition-case 1757 99% - let 1757 99% - while 1757 99% - setq 1757 99% - isearch-search-string 1757 99% - let* 1757 99% - save-excursion 1757 99% - funcall 1757 99% - #<lambda 0x2f020ea27a40> 1757 99% - let 1757 99% - condition-case 1757 99% - funcall 1757 99% - cond 7 0% - let 7 0% - if 7 0% - funcall 7 0% character-fold-to-regexp 7 0% - isearch-lazy-highlight-new-loop 2 0% - if 2 0% - and 2 0% - sit-for 2 0% redisplay 2 0% - if 1 0% - if 1 0% - isearch-message 1 0% - let 1 0% - if 1 0% let 1 0% - isearch-forward 1 0% - isearch-mode 1 0% - isearch-update 1 0% - if 1 0% - progn 1 0% - isearch-lazy-highlight-new-loop 1 0% - if 1 0% - and 1 0% - sit-for 1 0% - redisplay 1 0% - redisplay_internal (C function) 1 0% - find-image 1 0% image-search-load-path 1 0% - execute-extended-command 1 0% - command-execute 1 0% - call-interactively 1 0% - funcall-interactively 1 0% - profiler-report 1 0% - profiler-report-cpu 1 0% profiler-cpu-profile 1 0% - ... 5 0% Automatic GC 5 0% ^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". 2015-12-04 9:23 ` Eli Zaretskii @ 2015-12-04 15:16 ` Artur Malabarba 2015-12-04 15:23 ` Eli Zaretskii 2015-12-04 15:49 ` Random832 0 siblings, 2 replies; 26+ messages in thread From: Artur Malabarba @ 2015-12-04 15:16 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 22090, Alan Mackenzie 2015-12-04 9:23 GMT+00:00 Eli Zaretskii <eliz@gnu.org>: >> Date: Fri, 4 Dec 2015 04:20:52 +0000 >> From: Alan Mackenzie <acm@muc.de> >> >> With a recent emacs-25 (last update >> eaa1fd6dbff8346eb38485de5ebf0fbfacf374d9 from Thursday 2015-12-03): >> >> emacs -Q >> C-c C-f src/xdisp.c >> Move point to L30 (paragraph beginning "Updating the display is triggered >> by the Lisp interpreter ...") >> >> C-s >> C-w repeatedly, to yank words onto the search string. >> >> After ~29 words have been yanked, the response becomes sluggish, pausing >> for between 0.5s and 1s before highlighting the "for" at the end of L31. Thanks for the report. The source for this (and for a similar bug mentioned on a thread in emacs-devel) was the code I had added for special case-folding support. For now, I've just removed the code. I can think of a way of solving this, but it adds some complexity to isearch, which I don't wanna do (and I don't think this feature was that important anyway). Here's a full copy of the commit message explaining why the bug happens. 30f3432 * lisp/character-fold.el: Remove special case-folding support (character-fold-to-regexp): Remove special code for case-folding. Char-fold search still respects the `case-fold-search' variable (i.e., f matches F). This only removes the code that was added to ensure that f also matched all chars that F matched. For instance, after this commit, f no longer matches 𝔽. This was necessary because the logic created a regexp with 2^(length of the string) redundant paths. So, when a very long string "almost" matched, Emacs took a very long time to figure out that it didn't. This became particularly relevant because isearch's lazy-highlight does a search bounded by (1- match-end) (which, in most circumstances, is a search that almost matches). A recipe for this can be found in bug#22090. ^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". 2015-12-04 15:16 ` Artur Malabarba @ 2015-12-04 15:23 ` Eli Zaretskii 2015-12-04 16:06 ` Artur Malabarba 2015-12-04 15:49 ` Random832 1 sibling, 1 reply; 26+ messages in thread From: Eli Zaretskii @ 2015-12-04 15:23 UTC (permalink / raw) To: bruce.connor.am; +Cc: 22090, acm > Date: Fri, 4 Dec 2015 15:16:23 +0000 > From: Artur Malabarba <bruce.connor.am@gmail.com> > Cc: Alan Mackenzie <acm@muc.de>, 22090@debbugs.gnu.org > > 30f3432 * lisp/character-fold.el: Remove special case-folding support > > (character-fold-to-regexp): Remove special code for > case-folding. Char-fold search still respects the > `case-fold-search' variable (i.e., f matches F). This only > removes the code that was added to ensure that f also matched > all chars that F matched. For instance, after this commit, f > no longer matches 𝔽. Thanks. Is there any reasonably simple way of describing the resulting limitations (i.e. what will NOT match) on the user manual level? ^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". 2015-12-04 15:23 ` Eli Zaretskii @ 2015-12-04 16:06 ` Artur Malabarba 2015-12-04 16:27 ` Eli Zaretskii 0 siblings, 1 reply; 26+ messages in thread From: Artur Malabarba @ 2015-12-04 16:06 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 22090, Alan Mackenzie 2015-12-04 15:23 GMT+00:00 Eli Zaretskii <eliz@gnu.org>: >> Date: Fri, 4 Dec 2015 15:16:23 +0000 >> From: Artur Malabarba <bruce.connor.am@gmail.com> >> Cc: Alan Mackenzie <acm@muc.de>, 22090@debbugs.gnu.org >> >> 30f3432 * lisp/character-fold.el: Remove special case-folding support >> >> (character-fold-to-regexp): Remove special code for >> case-folding. Char-fold search still respects the >> `case-fold-search' variable (i.e., f matches F). This only >> removes the code that was added to ensure that f also matched >> all chars that F matched. For instance, after this commit, f >> no longer matches 𝔽. > > Thanks. Is there any reasonably simple way of describing the > resulting limitations (i.e. what will NOT match) on the user manual > level? Basically, 'a' will match similar characters (like '𝑎' and 'á') and their upper-case equivalents (like 'Á'). 'a' will NOT match characters similar to 'A' that don't have a lower-case equivalent (like '𝔸') in the unicode standard. ^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". 2015-12-04 16:06 ` Artur Malabarba @ 2015-12-04 16:27 ` Eli Zaretskii 2015-12-04 16:37 ` Artur Malabarba 0 siblings, 1 reply; 26+ messages in thread From: Eli Zaretskii @ 2015-12-04 16:27 UTC (permalink / raw) To: bruce.connor.am; +Cc: 22090, acm > Date: Fri, 4 Dec 2015 16:06:40 +0000 > From: Artur Malabarba <bruce.connor.am@gmail.com> > Cc: Alan Mackenzie <acm@muc.de>, 22090@debbugs.gnu.org > > > Thanks. Is there any reasonably simple way of describing the > > resulting limitations (i.e. what will NOT match) on the user manual > > level? > > Basically, 'a' will match similar characters (like '𝑎' and 'á') and > their upper-case equivalents (like 'Á'). 'a' will NOT match characters > similar to 'A' that don't have a lower-case equivalent (like '𝔸') in > the unicode standard. What about ligatures, or symbols like ℻? Also, by "lower-case equivalent" do you mean a case mapping defined by the UCD, or just by visual appearance? An example of the latter would be Ⓐ vs ⓐ. ^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". 2015-12-04 16:27 ` Eli Zaretskii @ 2015-12-04 16:37 ` Artur Malabarba 2015-12-04 18:48 ` Eli Zaretskii 0 siblings, 1 reply; 26+ messages in thread From: Artur Malabarba @ 2015-12-04 16:37 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 22090, Alan Mackenzie [-- Attachment #1: Type: text/plain, Size: 1087 bytes --] On 4 Dec 2015 4:27 pm, "Eli Zaretskii" <eliz@gnu.org> wrote: > > > Date: Fri, 4 Dec 2015 16:06:40 +0000 > > From: Artur Malabarba <bruce.connor.am@gmail.com> > > Cc: Alan Mackenzie <acm@muc.de>, 22090@debbugs.gnu.org > > > > > Thanks. Is there any reasonably simple way of describing the > > > resulting limitations (i.e. what will NOT match) on the user manual > > > level? > > > > Basically, 'a' will match similar characters (like '𝑎' and 'á') and > > their upper-case equivalents (like 'Á'). 'a' will NOT match characters > > similar to 'A' that don't have a lower-case equivalent (like '𝔸') in > > the unicode standard. > > What about ligatures, or symbols like ℻? Won't match cross-case. > Also, by "lower-case equivalent" do you mean a case mapping defined by > the UCD Yes. Visual appearance is irrelevant. Strictly speaking, to match a general character, you need to search for its decomposition. If this character also has a case "equivalent" (as per current-case-table), you can also search for the decomposition of this equivalent. [-- Attachment #2: Type: text/html, Size: 1599 bytes --] ^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". 2015-12-04 16:37 ` Artur Malabarba @ 2015-12-04 18:48 ` Eli Zaretskii 2015-12-04 19:59 ` Artur Malabarba 0 siblings, 1 reply; 26+ messages in thread From: Eli Zaretskii @ 2015-12-04 18:48 UTC (permalink / raw) To: bruce.connor.am; +Cc: 22090, acm > Date: Fri, 4 Dec 2015 16:37:08 +0000 > From: Artur Malabarba <bruce.connor.am@gmail.com> > Cc: 22090@debbugs.gnu.org, Alan Mackenzie <acm@muc.de> > > > What about ligatures, or symbols like ℻? > > Won't match cross-case. So f will match ffi but not ℻, and F will match ℻ but not ffi, is that right? > > Also, by "lower-case equivalent" do you mean a case mapping defined by > > the UCD > > Yes. Visual appearance is irrelevant. > > Strictly speaking, to match a general character, you need to search for its > decomposition. If this character also has a case "equivalent" (as per > current-case-table), you can also search for the decomposition of this > equivalent. OK, thanks. I will try to think if this needs to be explained in the manual. ^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". 2015-12-04 18:48 ` Eli Zaretskii @ 2015-12-04 19:59 ` Artur Malabarba 2015-12-05 9:19 ` Eli Zaretskii 0 siblings, 1 reply; 26+ messages in thread From: Artur Malabarba @ 2015-12-04 19:59 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 22090, Alan Mackenzie 2015-12-04 18:48 GMT+00:00 Eli Zaretskii <eliz@gnu.org>: >> > What about ligatures, or symbols like ℻? >> >> Won't match cross-case. > > So f will match ffi but not ℻, and F will match ℻ but not ffi, is that > right? No. ffi will match ffi, but FFI won't. And FAX will match ℻, but fax won't. f shouldn't match ffi anymore. ^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". 2015-12-04 19:59 ` Artur Malabarba @ 2015-12-05 9:19 ` Eli Zaretskii 0 siblings, 0 replies; 26+ messages in thread From: Eli Zaretskii @ 2015-12-05 9:19 UTC (permalink / raw) To: bruce.connor.am; +Cc: 22090, acm > Date: Fri, 4 Dec 2015 19:59:59 +0000 > From: Artur Malabarba <bruce.connor.am@gmail.com> > Cc: 22090@debbugs.gnu.org, Alan Mackenzie <acm@muc.de> > > 2015-12-04 18:48 GMT+00:00 Eli Zaretskii <eliz@gnu.org>: > >> > What about ligatures, or symbols like ℻? > >> > >> Won't match cross-case. > > > > So f will match ffi but not ℻, and F will match ℻ but not ffi, is that > > right? > > No. ffi will match ffi, but FFI won't. And FAX will match ℻, but fax won't. > f shouldn't match ffi anymore. OK, thanks. It seems like the text in the manual already complies with the above. ^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". 2015-12-04 15:16 ` Artur Malabarba 2015-12-04 15:23 ` Eli Zaretskii @ 2015-12-04 15:49 ` Random832 2015-12-04 16:21 ` Artur Malabarba 1 sibling, 1 reply; 26+ messages in thread From: Random832 @ 2015-12-04 15:49 UTC (permalink / raw) To: 22090 On 2015-12-04, Artur Malabarba <bruce.connor.am@gmail.com> wrote: > This was necessary because the logic created a regexp with > 2^(length of the string) redundant paths. So, when a very > long string "almost" matched, Emacs took a very long time to > figure out that it didn't. This became particularly relevant > because isearch's lazy-highlight does a search bounded by (1- > match-end) (which, in most circumstances, is a search that > almost matches). A recipe for this can be found in bug#22090. So has any thought been given to implementing folding searches via matching a simple regexp against a projected version of the buffer rather than the current mechanism of creating a regexp that will always match when it should? ^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". 2015-12-04 15:49 ` Random832 @ 2015-12-04 16:21 ` Artur Malabarba 2015-12-04 16:37 ` Random832 0 siblings, 1 reply; 26+ messages in thread From: Artur Malabarba @ 2015-12-04 16:21 UTC (permalink / raw) To: Random832; +Cc: 22090 [-- Attachment #1: Type: text/plain, Size: 483 bytes --] On 4 Dec 2015 3:49 pm, "Random832" <random832@fastmail.com> wrote: > So has any thought been given to implementing folding searches > via matching a simple regexp against a projected version of the > buffer rather than the current mechanism of creating a regexp > that will always match when it should? There were suggestions of projecting both the buffer and the search string (which is what case folding does) but nobody has offered to do it. What do you mean by "simple regexp"? [-- Attachment #2: Type: text/html, Size: 620 bytes --] ^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". 2015-12-04 16:21 ` Artur Malabarba @ 2015-12-04 16:37 ` Random832 2015-12-04 16:51 ` Artur Malabarba 2015-12-04 18:24 ` Eli Zaretskii 0 siblings, 2 replies; 26+ messages in thread From: Random832 @ 2015-12-04 16:37 UTC (permalink / raw) To: 22090 On 2015-12-04, Artur Malabarba <bruce.connor.am@gmail.com> wrote: > There were suggestions of projecting both the buffer and the search string > (which is what case folding does) but nobody has offered to do it. > What do you mean by "simple regexp"? As in none of the huge character classes for folding, just the characters the user types (normalized to all-lowercase or all-uppercase for case-folding searches) but maybe e.g. "\s+" for lax whitespace. ^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". 2015-12-04 16:37 ` Random832 @ 2015-12-04 16:51 ` Artur Malabarba 2015-12-04 18:24 ` Eli Zaretskii 1 sibling, 0 replies; 26+ messages in thread From: Artur Malabarba @ 2015-12-04 16:51 UTC (permalink / raw) To: Random832; +Cc: 22090 [-- Attachment #1: Type: text/plain, Size: 496 bytes --] On 4 Dec 2015 4:37 pm, "Random832" <random832@fastmail.com> wrote: > > On 2015-12-04, Artur Malabarba <bruce.connor.am@gmail.com> wrote: > > There were suggestions of projecting both the buffer and the search string > > (which is what case folding does) but nobody has offered to do it. > > What do you mean by "simple regexp"? > > As in... I see. Then the answer is the same. Nobody has offered to write the C code necessary to compare only a "projection" of the buffer with the search string. [-- Attachment #2: Type: text/html, Size: 722 bytes --] ^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". 2015-12-04 16:37 ` Random832 2015-12-04 16:51 ` Artur Malabarba @ 2015-12-04 18:24 ` Eli Zaretskii 1 sibling, 0 replies; 26+ messages in thread From: Eli Zaretskii @ 2015-12-04 18:24 UTC (permalink / raw) To: Random832; +Cc: 22090 > From: Random832 <random832@fastmail.com> > Date: Fri, 4 Dec 2015 16:37:59 +0000 (UTC) > > On 2015-12-04, Artur Malabarba <bruce.connor.am@gmail.com> wrote: > > There were suggestions of projecting both the buffer and the search string > > (which is what case folding does) but nobody has offered to do it. > > What do you mean by "simple regexp"? > > As in none of the huge character classes for folding, just the > characters the user types (normalized to all-lowercase or > all-uppercase for case-folding searches) but maybe e.g. "\s+" > for lax whitespace. You need to normalize the buffer text as well, so this must be done on the C level. ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <mailman.1363.1449242229.31583.bug-gnu-emacs@gnu.org>]
* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". [not found] ` <mailman.1363.1449242229.31583.bug-gnu-emacs@gnu.org> @ 2015-12-04 17:01 ` Alan Mackenzie 2015-12-04 19:21 ` Alan Mackenzie 1 sibling, 0 replies; 26+ messages in thread From: Alan Mackenzie @ 2015-12-04 17:01 UTC (permalink / raw) To: 22090-done; +Cc: bruce.connor.am Hello, Artur. In article <mailman.1363.1449242229.31583.bug-gnu-emacs@gnu.org> you wrote: > 2015-12-04 9:23 GMT+00:00 Eli Zaretskii <eliz@gnu.org>: >>> Date: Fri, 4 Dec 2015 04:20:52 +0000 >>> From: Alan Mackenzie <acm@muc.de> >>> >>> With a recent emacs-25 (last update >>> eaa1fd6dbff8346eb38485de5ebf0fbfacf374d9 from Thursday 2015-12-03): >>> >>> emacs -Q >>> C-c C-f src/xdisp.c >>> Move point to L30 (paragraph beginning "Updating the display is triggered >>> by the Lisp interpreter ...") >>> >>> C-s >>> C-w repeatedly, to yank words onto the search string. >>> >>> After ~29 words have been yanked, the response becomes sluggish, pausing >>> for between 0.5s and 1s before highlighting the "for" at the end of L31. > Thanks for the report. The source for this (and for a similar bug > mentioned on a thread in emacs-devel) was the code I had added for > special case-folding support. > For now, I've just removed the code. I can think of a way of solving > this, but it adds some complexity to isearch, which I don't wanna do > (and I don't think this feature was that important anyway). Here's a > full copy of the commit message explaining why the bug happens. Thanks for reacting to this so quickly. I confirm that both symptoms of the bug have been resolved. So I'm closing this bug. [ .... ] -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". [not found] ` <mailman.1363.1449242229.31583.bug-gnu-emacs@gnu.org> 2015-12-04 17:01 ` Alan Mackenzie @ 2015-12-04 19:21 ` Alan Mackenzie 2015-12-04 20:08 ` Eli Zaretskii 2015-12-04 20:49 ` Artur Malabarba 1 sibling, 2 replies; 26+ messages in thread From: Alan Mackenzie @ 2015-12-04 19:21 UTC (permalink / raw) To: bruce.connor.am; +Cc: 22090 Hello, Artur. In article <mailman.1363.1449242229.31583.bug-gnu-emacs@gnu.org> you wrote: > 2015-12-04 9:23 GMT+00:00 Eli Zaretskii <eliz@gnu.org>: >>> Date: Fri, 4 Dec 2015 04:20:52 +0000 >>> From: Alan Mackenzie <acm@muc.de> >>> >>> With a recent emacs-25 (last update >>> eaa1fd6dbff8346eb38485de5ebf0fbfacf374d9 from Thursday 2015-12-03): >>> >>> emacs -Q >>> C-c C-f src/xdisp.c >>> Move point to L30 (paragraph beginning "Updating the display is triggered >>> by the Lisp interpreter ...") >>> >>> C-s >>> C-w repeatedly, to yank words onto the search string. >>> >>> After ~29 words have been yanked, the response becomes sluggish, pausing >>> for between 0.5s and 1s before highlighting the "for" at the end of L31. > Thanks for the report. The source for this (and for a similar bug > mentioned on a thread in emacs-devel) was the code I had added for > special case-folding support. > For now, I've just removed the code. I can think of a way of solving > this, but it adds some complexity to isearch, which I don't wanna do > (and I don't think this feature was that important anyway). Here's a > full copy of the commit message explaining why the bug happens. > 30f3432 * lisp/character-fold.el: Remove special case-folding support > (character-fold-to-regexp): Remove special code for > case-folding. Char-fold search still respects the > `case-fold-search' variable (i.e., f matches F). This only > removes the code that was added to ensure that f also matched > all chars that F matched. For instance, after this commit, f > no longer matches 𝔽. > This was necessary because the logic created a regexp with > 2^(length of the string) redundant paths. So, when a very > long string "almost" matched, Emacs took a very long time to > figure out that it didn't. This became particularly relevant > because isearch's lazy-highlight does a search bounded by (1- > match-end) (which, in most circumstances, is a search that > almost matches). A recipe for this can be found in bug#22090. Would you like any help to sort out these regexps? I have some expertise in doing this, having half-written fix-re.el, a program which analyses and corrects just the sort of thing you're talking about. -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". 2015-12-04 19:21 ` Alan Mackenzie @ 2015-12-04 20:08 ` Eli Zaretskii 2015-12-04 20:49 ` Artur Malabarba 1 sibling, 0 replies; 26+ messages in thread From: Eli Zaretskii @ 2015-12-04 20:08 UTC (permalink / raw) To: Alan Mackenzie; +Cc: 22090, bruce.connor.am > Date: 4 Dec 2015 19:21:26 -0000 > From: Alan Mackenzie <acm@muc.de> > Cc: 22090@debbugs.gnu.org > > > (character-fold-to-regexp): Remove special code for > > case-folding. Char-fold search still respects the > > `case-fold-search' variable (i.e., f matches F). This only > > removes the code that was added to ensure that f also matched > > all chars that F matched. For instance, after this commit, f > > no longer matches 𝔽. > > > This was necessary because the logic created a regexp with > > 2^(length of the string) redundant paths. So, when a very > > long string "almost" matched, Emacs took a very long time to > > figure out that it didn't. This became particularly relevant > > because isearch's lazy-highlight does a search bounded by (1- > > match-end) (which, in most circumstances, is a search that > > almost matches). A recipe for this can be found in bug#22090. > > Would you like any help to sort out these regexps? I'm not sure the use cases related to case folding should be working in principle. That's because normalization under case folding means first downcase, then decompose; it is not allowed to downcase the decomposition. So if this issue is only about those, I don't think there's anything to sort out here, thanks. ^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". 2015-12-04 19:21 ` Alan Mackenzie 2015-12-04 20:08 ` Eli Zaretskii @ 2015-12-04 20:49 ` Artur Malabarba 2015-12-04 23:00 ` Alan Mackenzie 1 sibling, 1 reply; 26+ messages in thread From: Artur Malabarba @ 2015-12-04 20:49 UTC (permalink / raw) To: Alan Mackenzie; +Cc: 22090 2015-12-04 19:21 GMT+00:00 Alan Mackenzie <acm@muc.de>: > Would you like any help to sort out these regexps? I have some expertise > in doing this, having half-written fix-re.el, a program which analyses > and corrects just the sort of thing you're talking about. Maybe you can help then. The situation is actually quite simple. We have a regexp for matching anything that 'a' should match (for instance, that might look like "\\(a[´`]?\\|[áà𝑎]\\)"), and we have another for matching anything that A could match (e.g. "\\(A[`´]?\\|[ÁÀ]\\)"). When case-fold-search is on the previous code would simply join these regexps with "\\(\\(a[´`]?\\|[áà𝑎]\\)\\|\\(A[`´]?\\|[ÁÀ]\\)\\)". The problem is that (when case-fold-search is on) this creates a lot of redundancy. There are two paths in that regexp that match "a", there are two paths that match "à" and so on (but it's not full redundancy, for instance, only one path matches 𝑎). ^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". 2015-12-04 20:49 ` Artur Malabarba @ 2015-12-04 23:00 ` Alan Mackenzie 2015-12-05 17:23 ` Artur Malabarba 0 siblings, 1 reply; 26+ messages in thread From: Alan Mackenzie @ 2015-12-04 23:00 UTC (permalink / raw) To: Artur Malabarba; +Cc: 22090 Hello, Artur. On Fri, Dec 04, 2015 at 08:49:42PM +0000, Artur Malabarba wrote: > 2015-12-04 19:21 GMT+00:00 Alan Mackenzie <acm@muc.de>: > > Would you like any help to sort out these regexps? I have some expertise > > in doing this, having half-written fix-re.el, a program which analyses > > and corrects just the sort of thing you're talking about. > Maybe you can help then. The situation is actually quite simple. > We have a regexp for matching anything that 'a' should match (for > instance, that might look like "\\(a[´`]?\\|[áà𝑎]\\)"), and we have > another for matching anything that A could match (e.g. > "\\(A[`´]?\\|[ÁÀ]\\)"). Each of these regexps looks intrinsically blameless.. > When case-fold-search is on the previous code would simply join these > regexps with "\\(\\(a[´`]?\\|[áà𝑎]\\)\\|\\(A[`´]?\\|[ÁÀ]\\)\\)". Quick question: _why_ do you need to join them? Given that case-fold-search is enabled, couldn't you just use, say, the lower case version? > The problem is that (when case-fold-search is on) this creates a lot > of redundancy. There are two paths in that regexp that match "a", > there are two paths that match "à" and so on (but it's not full > redundancy, for instance, only one path matches 𝑎). Yes. This is the killer danger in regexps (at least with the sort of regexp engine we've got). But it looks to me that this redundancy would be quite easy to eliminate - you just need three regexp fragments for the letter "a" - a lower case one, an upper case one and a case-fold-search one. The other thing is that for that single character "a" a 39 character regexp fragment is being generated. Might this have something to do with the "[Too many words]" error I got last night (which comes from the regexp engine returning a "too long regexp" error)? Even if you can reduce that to, say 19 characters, that's only winning a factor of 2 in the slide towards a too long regexp. It might well be that for a very long regexp, you might have to divide it into shorter sections (a typical long RE will by a sequence of sub expressions, rather than lots of alternatives inside \(...\|........\)). -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". 2015-12-04 23:00 ` Alan Mackenzie @ 2015-12-05 17:23 ` Artur Malabarba 2015-12-05 17:32 ` Eli Zaretskii 2015-12-05 18:52 ` Alan Mackenzie 0 siblings, 2 replies; 26+ messages in thread From: Artur Malabarba @ 2015-12-05 17:23 UTC (permalink / raw) To: Alan Mackenzie; +Cc: 22090 nn2015-12-04 23:00 GMT+00:00 Alan Mackenzie <acm@muc.de>: >> When case-fold-search is on the previous code would simply join these >> regexps with "\\(\\(a[´`]?\\|[áà𝑎]\\)\\|\\(A[`´]?\\|[ÁÀ]\\)\\)". > > Quick question: _why_ do you need to join them? Given that > case-fold-search is enabled, couldn't you just use, say, the lower case > version? Because there are some characters in each regexp that don't have lower/upper-case equivalents. For instance, if I use the "\\(\\(a[´`]?\\|[áà𝑎]\\)" regexp, that's enough to match A or À, but it's not enough to match a variety of other chars (𝔸𝕬𝖠𝗔𝘈𝘼𝙰🄰). > it looks to me that this redundancy would > be quite easy to eliminate - you just need three regexp fragments for > the letter "a" - a lower case one, an upper case one and a > case-fold-search one. Yes, we could go that route. It's just going to add complexity to the code that generates the char-fold-table (which is already quite dense) and I wonder if it's worth such a corner-case. Like I said, 'a' already matches A and À, how much do we want to support this extra case-folding? > The other thing is that for that single character "a" a 39 character > regexp fragment is being generated. Might this have something to do > with the "[Too many words]" error I got last night (which comes from the > regexp engine returning a "too long regexp" error)? yes > Even if you can reduce that to, say 19 characters, that's only winning a > factor of 2 in the slide towards a too long regexp. It might well be > that for a very long regexp, you might have to divide it into shorter > sections (a typical long RE will by a sequence of sub expressions, > rather than lots of alternatives inside \(...\|........\)). I don't understand what you mean. Could you elaborate? ^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". 2015-12-05 17:23 ` Artur Malabarba @ 2015-12-05 17:32 ` Eli Zaretskii 2015-12-05 18:12 ` Artur Malabarba 2015-12-05 18:52 ` Alan Mackenzie 1 sibling, 1 reply; 26+ messages in thread From: Eli Zaretskii @ 2015-12-05 17:32 UTC (permalink / raw) To: bruce.connor.am; +Cc: 22090, acm > Date: Sat, 5 Dec 2015 17:23:53 +0000 > From: Artur Malabarba <bruce.connor.am@gmail.com> > Cc: 22090@debbugs.gnu.org > > nn2015-12-04 23:00 GMT+00:00 Alan Mackenzie <acm@muc.de>: > >> When case-fold-search is on the previous code would simply join these > >> regexps with "\\(\\(a[´`]?\\|[áà𝑎]\\)\\|\\(A[`´]?\\|[ÁÀ]\\)\\)". > > > > Quick question: _why_ do you need to join them? Given that > > case-fold-search is enabled, couldn't you just use, say, the lower case > > version? > > Because there are some characters in each regexp that don't have > lower/upper-case equivalents. For instance, if I use the > "\\(\\(a[´`]?\\|[áà𝑎]\\)" regexp, that's enough to match A or À, but > it's not enough to match a variety of other chars (𝔸𝕬𝖠𝗔𝘈𝘼𝙰🄰). You don't need to match the latter set. Character folding is applied _after_ case folding, not before. So characters that don't have a lower-case variant simply shouldn't match a lower-case a -- and they won't, if you just let case-insensitive regexp matching do its job. ^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". 2015-12-05 17:32 ` Eli Zaretskii @ 2015-12-05 18:12 ` Artur Malabarba 2015-12-05 18:34 ` Eli Zaretskii 0 siblings, 1 reply; 26+ messages in thread From: Artur Malabarba @ 2015-12-05 18:12 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 22090, Alan Mackenzie 2015-12-05 17:32 GMT+00:00 Eli Zaretskii <eliz@gnu.org>: >> Because there are some characters in each regexp that don't have >> lower/upper-case equivalents. For instance, if I use the >> "\\(\\(a[´`]?\\|[áà𝑎]\\)" regexp, that's enough to match A or À, but >> it's not enough to match a variety of other chars (𝔸𝕬𝖠𝗔𝘈𝘼𝙰🄰). > > You don't need to match the latter set. Character folding is applied > _after_ case folding, not before. So characters that don't have a > lower-case variant simply shouldn't match a lower-case a -- and they > won't, if you just let case-insensitive regexp matching do its job. Given that char-folding is a new feature, how it combines with case-folding is entirely up to us, and I have really no idea what would be TRT. However, if that is your opinion, I'm more than happy to accept that the current situation ('a' doesn't match '𝔸𝕬𝖠𝗔𝘈𝘼𝙰🄰') is TRT, given that it has the simplest implementation. :-) ^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". 2015-12-05 18:12 ` Artur Malabarba @ 2015-12-05 18:34 ` Eli Zaretskii 0 siblings, 0 replies; 26+ messages in thread From: Eli Zaretskii @ 2015-12-05 18:34 UTC (permalink / raw) To: bruce.connor.am; +Cc: 22090, acm > Date: Sat, 5 Dec 2015 18:12:52 +0000 > From: Artur Malabarba <bruce.connor.am@gmail.com> > Cc: Alan Mackenzie <acm@muc.de>, 22090@debbugs.gnu.org > > 2015-12-05 17:32 GMT+00:00 Eli Zaretskii <eliz@gnu.org>: > >> Because there are some characters in each regexp that don't have > >> lower/upper-case equivalents. For instance, if I use the > >> "\\(\\(a[´`]?\\|[áà𝑎]\\)" regexp, that's enough to match A or À, but > >> it's not enough to match a variety of other chars (𝔸𝕬𝖠𝗔𝘈𝘼𝙰🄰). > > > > You don't need to match the latter set. Character folding is applied > > _after_ case folding, not before. So characters that don't have a > > lower-case variant simply shouldn't match a lower-case a -- and they > > won't, if you just let case-insensitive regexp matching do its job. > > Given that char-folding is a new feature, how it combines with > case-folding is entirely up to us, and I have really no idea what > would be TRT. I don't think there's any reasonable alternative, because for characters that have a decomposition, you wouldn't downcase the result of the decomposition, would you? The Unicode Standard also says this much (p.158): In principle, normalization needs to be done after case folding, because case folding does not preserve the normalized form of strings in all instances. (There are a couple of examples there showing why the reverse order could cause incorrect results.) So if this is true for normalization, it should also be true for the case in point. > However, if that is your opinion, I'm more than happy to accept that > the current situation ('a' doesn't match '𝔸𝕬𝖠𝗔𝘈𝘼𝙰🄰') is TRT, > given that it has the simplest implementation. :-) Yes, I think it's TRT. ^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". 2015-12-05 17:23 ` Artur Malabarba 2015-12-05 17:32 ` Eli Zaretskii @ 2015-12-05 18:52 ` Alan Mackenzie 2015-12-06 12:50 ` Artur Malabarba 1 sibling, 1 reply; 26+ messages in thread From: Alan Mackenzie @ 2015-12-05 18:52 UTC (permalink / raw) To: Artur Malabarba; +Cc: 22090 Hello, Artur. On Sat, Dec 05, 2015 at 05:23:53PM +0000, Artur Malabarba wrote: > nn2015-12-04 23:00 GMT+00:00 Alan Mackenzie <acm@muc.de>: > >> When case-fold-search is on the previous code would simply join these > >> regexps with "\\(\\(a[´`]?\\|[áà𝑎]\\)\\|\\(A[`´]?\\|[ÁÀ]\\)\\)". > > Quick question: _why_ do you need to join them? Given that > > case-fold-search is enabled, couldn't you just use, say, the lower case > > version? > Because there are some characters in each regexp that don't have > lower/upper-case equivalents. For instance, if I use the > "\\(\\(a[´`]?\\|[áà𝑎]\\)" regexp, that's enough to match A or À, but > it's not enough to match a variety of other chars (𝔸𝕬𝖠𝗔𝘈𝘼𝙰🄰). OK, thanks. > > it looks to me that this redundancy would > > be quite easy to eliminate - you just need three regexp fragments for > > the letter "a" - a lower case one, an upper case one and a > > case-fold-search one. > Yes, we could go that route. It's just going to add complexity to the > code that generates the char-fold-table (which is already quite dense) > and I wonder if it's worth such a corner-case. Like I said, 'a' > already matches A and À, how much do we want to support this extra > case-folding? But it seems the complexity (and it can't honestly be that much, surely?) is intrinsic to the task being carried out. Sticking a "\\|" between the upper case and lower case versions clearly doesn't work. Seriously, how difficult can it be to generate "\\([Aa][´`]?\\|[áà𝑎ÁÀ]\\)" , which is a blameless regexp, given where you've already got to? > > The other thing is that for that single character "a" a 39 character > > regexp fragment is being generated. Might this have something to do > > with the "[Too many words]" error I got last night (which comes from the > > regexp engine returning a "too long regexp" error)? > yes I was afraid of that. > > Even if you can reduce that to, say 19 characters, that's only winning a > > factor of 2 in the slide towards a too long regexp. It might well be > > that for a very long regexp, you might have to divide it into shorter > > sections (a typical long RE will by a sequence of sub expressions, > > rather than lots of alternatives inside \(...\|........\)). > I don't understand what you mean. Could you elaborate? Once you've generated the long regexp, if it's too long, you can split it up into, say, 3 pieces A, B, C, such that (equal re (concat A B C)). Then you can do something like: (and (search-forward-regexp A bound noerror) (search-forward-regexp (concat "\\=" B) bound noerror) (search-forward-regexp (concat "\\=" C) bound noerror)) . Though, thinking about it, it might be less painful to enhance the regexp engine to take longer regexps. -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]". 2015-12-05 18:52 ` Alan Mackenzie @ 2015-12-06 12:50 ` Artur Malabarba 0 siblings, 0 replies; 26+ messages in thread From: Artur Malabarba @ 2015-12-06 12:50 UTC (permalink / raw) To: Alan Mackenzie; +Cc: 22090 2015-12-05 18:52 GMT+00:00 Alan Mackenzie <acm@muc.de>: > But it seems the complexity (and it can't honestly be that much, > surely?) is intrinsic to the task being carried out. Sticking a "\\|" > between the upper case and lower case versions clearly doesn't work. > > Seriously, how difficult can it be to generate > > "\\([Aa][´`]?\\|[áà𝑎ÁÀ]\\)" > > , which is a blameless regexp, given where you've already got to? Oh. I see. I thought you were talking about mutually exclusive regexps. Indeed a regexp like that would be trivial to generate. But is it really blameless? I mean, if "\\(A\\|a\\)" can lead to extremely slow searches, doesn't the same happen with "[Aa]"? Anyway, at this point I'm just asking for future knowledge/reference. According to Eli, the current implementation is in accordance with the Unicode Standard. So it's probably best to keep it this way at least for the first release of the feature. > Once you've generated the long regexp, if it's too long, you can split > it up into, say, 3 pieces A, B, C, such that (equal re (concat A B C)). > > Then you can do something like: > > (and (search-forward-regexp A bound noerror) > (search-forward-regexp (concat "\\=" B) bound noerror) > (search-forward-regexp (concat "\\=" C) bound noerror)) > > . Though, thinking about it, it might be less painful to enhance the > regexp engine to take longer regexps. Besides. Char-folding is supposed to turn strings into regexps usable anywhere, and this wouldn't work with that. I've added a clause to the function so that it won't do any charfolding if the resulting regexp would be longer than 5k chars (instead it will just regexp-quote). That will at least prevent the too-many words error in isearch. (I already had this clause in there before, but it was using 10k, which apparently is not enough). ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2015-12-06 12:50 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-12-04 4:20 bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]" Alan Mackenzie 2015-12-04 9:23 ` Eli Zaretskii 2015-12-04 15:16 ` Artur Malabarba 2015-12-04 15:23 ` Eli Zaretskii 2015-12-04 16:06 ` Artur Malabarba 2015-12-04 16:27 ` Eli Zaretskii 2015-12-04 16:37 ` Artur Malabarba 2015-12-04 18:48 ` Eli Zaretskii 2015-12-04 19:59 ` Artur Malabarba 2015-12-05 9:19 ` Eli Zaretskii 2015-12-04 15:49 ` Random832 2015-12-04 16:21 ` Artur Malabarba 2015-12-04 16:37 ` Random832 2015-12-04 16:51 ` Artur Malabarba 2015-12-04 18:24 ` Eli Zaretskii [not found] ` <mailman.1363.1449242229.31583.bug-gnu-emacs@gnu.org> 2015-12-04 17:01 ` Alan Mackenzie 2015-12-04 19:21 ` Alan Mackenzie 2015-12-04 20:08 ` Eli Zaretskii 2015-12-04 20:49 ` Artur Malabarba 2015-12-04 23:00 ` Alan Mackenzie 2015-12-05 17:23 ` Artur Malabarba 2015-12-05 17:32 ` Eli Zaretskii 2015-12-05 18:12 ` Artur Malabarba 2015-12-05 18:34 ` Eli Zaretskii 2015-12-05 18:52 ` Alan Mackenzie 2015-12-06 12:50 ` Artur Malabarba
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).