bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
@ 2015-12-04  4:20 Alan Mackenzie
  2015-12-04  9:23 ` Eli Zaretskii
       [not found] ` <mailman.1363.1449242229.31583.bug-gnu-emacs@gnu.org>
  0 siblings, 2 replies; 26+ messages in thread
From: Alan Mackenzie @ 2015-12-04  4:20 UTC (permalink / raw)
  To: 22090

Hello, Emacs

With a recent emacs-25 (last update
eaa1fd6dbff8346eb38485de5ebf0fbfacf374d9 from Thursday 2015-12-03):

emacs -Q
C-c C-f src/xdisp.c
Move point to L30 (paragraph beginning "Updating the display is triggered
  by the Lisp interpreter ...")

C-s
C-w repeatedly, to yank words onto the search string.

After ~29 words have been yanked, the response becomes sluggish, pausing
for between 0.5s and 1s before highlighting the "for" at the end of L31.

Carrying on with C-w, some words are taking 2 or 3 seconds to be
registered by Isearch.  This is Bad.

After having yanked "you as part of" from L32,
(i) the " of" gets highlighted in the isearch-error face in the echo
  area;
(ii) the text "[Too many words]" is appended to the echo area;
(iii) the highlighting is removed from the match;
(iv) point is placed at the start of the match (i.e. BOL 30).

At this point, <BKSP> will still behave as expected, except it's action
too is very sluggish - to remove two words from the current search took
several seconds.

Observation: it may be that C-w done in the vicinity of two or several
spaces experiences extra delay.

-- 
Alan Mackenzie (Nuremberg, Germany).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
  2015-12-04  4:20 bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]" Alan Mackenzie
@ 2015-12-04  9:23 ` Eli Zaretskii
  2015-12-04 15:16   ` Artur Malabarba
       [not found] ` <mailman.1363.1449242229.31583.bug-gnu-emacs@gnu.org>
  1 sibling, 1 reply; 26+ messages in thread
From: Eli Zaretskii @ 2015-12-04  9:23 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: 22090

> Date: Fri, 4 Dec 2015 04:20:52 +0000
> From: Alan Mackenzie <acm@muc.de>
> 
> With a recent emacs-25 (last update
> eaa1fd6dbff8346eb38485de5ebf0fbfacf374d9 from Thursday 2015-12-03):
> 
> emacs -Q
> C-c C-f src/xdisp.c
> Move point to L30 (paragraph beginning "Updating the display is triggered
>   by the Lisp interpreter ...")
> 
> C-s
> C-w repeatedly, to yank words onto the search string.
> 
> After ~29 words have been yanked, the response becomes sluggish, pausing
> for between 0.5s and 1s before highlighting the "for" at the end of L31.
> 
> Carrying on with C-w, some words are taking 2 or 3 seconds to be
> registered by Isearch.  This is Bad.

Here's a profile for this part:

 - command-execute                                                1762  99%
  - call-interactively                                            1762  99%
   - funcall-interactively                                        1762  99%
    - isearch-yank-word-or-char                                   1760  99%
     - isearch-yank-internal                                      1760  99%
      - isearch-yank-string                                       1760  99%
       - isearch-process-search-string                            1760  99%
	- isearch-search-and-update                               1760  99%
	 - isearch-update                                         1760  99%
	  - if                                                    1760  99%
	   - progn                                                1760  99%
	    - while                                               1757  99%
	     - let                                                1757  99%
	      - isearch-lazy-highlight-search                     1757  99%
	       - condition-case                                   1757  99%
		- let                                             1757  99%
		 - while                                          1757  99%
		  - setq                                          1757  99%
		   - isearch-search-string                        1757  99%
		    - let*                                        1757  99%
		     - save-excursion                             1757  99%
		      - funcall                                   1757  99%
		       - #<lambda 0x2f020ea27a40>                 1757  99%
			- let                                     1757  99%
			 - condition-case                         1757  99%
			  - funcall                               1757  99%
			   - cond                                    7   0%
			    - let                                    7   0%
			     - if                                    7   0%
			      - funcall                              7   0%
				 character-fold-to-regexp                  7   0%
	    - isearch-lazy-highlight-new-loop                        2   0%
	     - if                                                    2   0%
	      - and                                                  2   0%
	       - sit-for                                             2   0%
		  redisplay                                          2   0%
	    - if                                                     1   0%
	     - if                                                    1   0%
	      - isearch-message                                      1   0%
	       - let                                                 1   0%
		- if                                                 1   0%
		   let                                               1   0%
    - isearch-forward                                                1   0%
     - isearch-mode                                                  1   0%
      - isearch-update                                               1   0%
       - if                                                          1   0%
	- progn                                                      1   0%
	 - isearch-lazy-highlight-new-loop                           1   0%
	  - if                                                       1   0%
	   - and                                                     1   0%
	    - sit-for                                                1   0%
	     - redisplay                                             1   0%
	      - redisplay_internal (C function)                      1   0%
	       - find-image                                          1   0%
		  image-search-load-path                             1   0%
    - execute-extended-command                                       1   0%
     - command-execute                                               1   0%
      - call-interactively                                           1   0%
       - funcall-interactively                                       1   0%
	- profiler-report                                            1   0%
	 - profiler-report-cpu                                       1   0%
	    profiler-cpu-profile                                     1   0%
 - ...                                                               5   0%
    Automatic GC                                                     5   0%





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
  2015-12-04  9:23 ` Eli Zaretskii
@ 2015-12-04 15:16   ` Artur Malabarba
  2015-12-04 15:23     ` Eli Zaretskii
  2015-12-04 15:49     ` Random832
  0 siblings, 2 replies; 26+ messages in thread
From: Artur Malabarba @ 2015-12-04 15:16 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 22090, Alan Mackenzie

2015-12-04 9:23 GMT+00:00 Eli Zaretskii <eliz@gnu.org>:
>> Date: Fri, 4 Dec 2015 04:20:52 +0000
>> From: Alan Mackenzie <acm@muc.de>
>>
>> With a recent emacs-25 (last update
>> eaa1fd6dbff8346eb38485de5ebf0fbfacf374d9 from Thursday 2015-12-03):
>>
>> emacs -Q
>> C-c C-f src/xdisp.c
>> Move point to L30 (paragraph beginning "Updating the display is triggered
>>   by the Lisp interpreter ...")
>>
>> C-s
>> C-w repeatedly, to yank words onto the search string.
>>
>> After ~29 words have been yanked, the response becomes sluggish, pausing
>> for between 0.5s and 1s before highlighting the "for" at the end of L31.

Thanks for the report. The source for this (and for a similar bug
mentioned on a thread in emacs-devel) was the code I had added for
special case-folding support.
For now, I've just removed the code. I can think of a way of solving
this, but it adds some complexity to isearch, which I don't wanna do
(and I don't think this feature was that important anyway). Here's a
full copy of the commit message explaining why the bug happens.

30f3432 * lisp/character-fold.el: Remove special case-folding support

(character-fold-to-regexp): Remove special code for
case-folding.  Char-fold search still respects the
`case-fold-search' variable (i.e., f matches F).  This only
removes the code that was added to ensure that f also matched
all chars that F matched.  For instance, after this commit, f
no longer matches 𝔽.

This was necessary because the logic created a regexp with
2^(length of the string) redundant paths.  So, when a very
long string "almost" matched, Emacs took a very long time to
figure out that it didn't.  This became particularly relevant
because isearch's lazy-highlight does a search bounded by (1-
match-end) (which, in most circumstances, is a search that
almost matches).  A recipe for this can be found in bug#22090.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
  2015-12-04 15:16   ` Artur Malabarba
@ 2015-12-04 15:23     ` Eli Zaretskii
  2015-12-04 16:06       ` Artur Malabarba
  2015-12-04 15:49     ` Random832
  1 sibling, 1 reply; 26+ messages in thread
From: Eli Zaretskii @ 2015-12-04 15:23 UTC (permalink / raw)
  To: bruce.connor.am; +Cc: 22090, acm

> Date: Fri, 4 Dec 2015 15:16:23 +0000
> From: Artur Malabarba <bruce.connor.am@gmail.com>
> Cc: Alan Mackenzie <acm@muc.de>, 22090@debbugs.gnu.org
> 
> 30f3432 * lisp/character-fold.el: Remove special case-folding support
> 
> (character-fold-to-regexp): Remove special code for
> case-folding.  Char-fold search still respects the
> `case-fold-search' variable (i.e., f matches F).  This only
> removes the code that was added to ensure that f also matched
> all chars that F matched.  For instance, after this commit, f
> no longer matches 𝔽.

Thanks.  Is there any reasonably simple way of describing the
resulting limitations (i.e. what will NOT match) on the user manual
level?





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
  2015-12-04 15:16   ` Artur Malabarba
  2015-12-04 15:23     ` Eli Zaretskii
@ 2015-12-04 15:49     ` Random832
  2015-12-04 16:21       ` Artur Malabarba
  1 sibling, 1 reply; 26+ messages in thread
From: Random832 @ 2015-12-04 15:49 UTC (permalink / raw)
  To: 22090

On 2015-12-04, Artur Malabarba <bruce.connor.am@gmail.com> wrote:
> This was necessary because the logic created a regexp with
> 2^(length of the string) redundant paths.  So, when a very
> long string "almost" matched, Emacs took a very long time to
> figure out that it didn't.  This became particularly relevant
> because isearch's lazy-highlight does a search bounded by (1-
> match-end) (which, in most circumstances, is a search that
> almost matches).  A recipe for this can be found in bug#22090.

So has any thought been given to implementing folding searches
via matching a simple regexp against a projected version of the
buffer rather than the current mechanism of creating a regexp
that will always match when it should?






^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
  2015-12-04 15:23     ` Eli Zaretskii
@ 2015-12-04 16:06       ` Artur Malabarba
  2015-12-04 16:27         ` Eli Zaretskii
  0 siblings, 1 reply; 26+ messages in thread
From: Artur Malabarba @ 2015-12-04 16:06 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 22090, Alan Mackenzie

2015-12-04 15:23 GMT+00:00 Eli Zaretskii <eliz@gnu.org>:
>> Date: Fri, 4 Dec 2015 15:16:23 +0000
>> From: Artur Malabarba <bruce.connor.am@gmail.com>
>> Cc: Alan Mackenzie <acm@muc.de>, 22090@debbugs.gnu.org
>>
>> 30f3432 * lisp/character-fold.el: Remove special case-folding support
>>
>> (character-fold-to-regexp): Remove special code for
>> case-folding.  Char-fold search still respects the
>> `case-fold-search' variable (i.e., f matches F).  This only
>> removes the code that was added to ensure that f also matched
>> all chars that F matched.  For instance, after this commit, f
>> no longer matches 𝔽.
>
> Thanks.  Is there any reasonably simple way of describing the
> resulting limitations (i.e. what will NOT match) on the user manual
> level?

Basically, 'a' will match similar characters (like '𝑎' and 'á') and
their upper-case equivalents (like 'Á'). 'a' will NOT match characters
similar to 'A' that don't have a lower-case equivalent (like '𝔸') in
the unicode standard.





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
  2015-12-04 15:49     ` Random832
@ 2015-12-04 16:21       ` Artur Malabarba
  2015-12-04 16:37         ` Random832
  0 siblings, 1 reply; 26+ messages in thread
From: Artur Malabarba @ 2015-12-04 16:21 UTC (permalink / raw)
  To: Random832; +Cc: 22090

[-- Attachment #1: Type: text/plain, Size: 483 bytes --]

On 4 Dec 2015 3:49 pm, "Random832" <random832@fastmail.com> wrote:
> So has any thought been given to implementing folding searches
> via matching a simple regexp against a projected version of the
> buffer rather than the current mechanism of creating a regexp
> that will always match when it should?

There were suggestions of projecting both the buffer and the search string
(which is what case folding does) but nobody has offered to do it.
What do you mean by "simple regexp"?

[-- Attachment #2: Type: text/html, Size: 620 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
  2015-12-04 16:06       ` Artur Malabarba
@ 2015-12-04 16:27         ` Eli Zaretskii
  2015-12-04 16:37           ` Artur Malabarba
  0 siblings, 1 reply; 26+ messages in thread
From: Eli Zaretskii @ 2015-12-04 16:27 UTC (permalink / raw)
  To: bruce.connor.am; +Cc: 22090, acm

> Date: Fri, 4 Dec 2015 16:06:40 +0000
> From: Artur Malabarba <bruce.connor.am@gmail.com>
> Cc: Alan Mackenzie <acm@muc.de>, 22090@debbugs.gnu.org
> 
> > Thanks.  Is there any reasonably simple way of describing the
> > resulting limitations (i.e. what will NOT match) on the user manual
> > level?
> 
> Basically, 'a' will match similar characters (like '𝑎' and 'á') and
> their upper-case equivalents (like 'Á'). 'a' will NOT match characters
> similar to 'A' that don't have a lower-case equivalent (like '𝔸') in
> the unicode standard.

What about ligatures, or symbols like ℻?

Also, by "lower-case equivalent" do you mean a case mapping defined by
the UCD, or just by visual appearance?  An example of the latter would
be Ⓐ vs ⓐ.





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
  2015-12-04 16:27         ` Eli Zaretskii
@ 2015-12-04 16:37           ` Artur Malabarba
  2015-12-04 18:48             ` Eli Zaretskii
  0 siblings, 1 reply; 26+ messages in thread
From: Artur Malabarba @ 2015-12-04 16:37 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 22090, Alan Mackenzie

[-- Attachment #1: Type: text/plain, Size: 1087 bytes --]

On 4 Dec 2015 4:27 pm, "Eli Zaretskii" <eliz@gnu.org> wrote:
>
> > Date: Fri, 4 Dec 2015 16:06:40 +0000
> > From: Artur Malabarba <bruce.connor.am@gmail.com>
> > Cc: Alan Mackenzie <acm@muc.de>, 22090@debbugs.gnu.org
> >
> > > Thanks.  Is there any reasonably simple way of describing the
> > > resulting limitations (i.e. what will NOT match) on the user manual
> > > level?
> >
> > Basically, 'a' will match similar characters (like '𝑎' and 'á') and
> > their upper-case equivalents (like 'Á'). 'a' will NOT match characters
> > similar to 'A' that don't have a lower-case equivalent (like '𝔸') in
> > the unicode standard.
>
> What about ligatures, or symbols like ℻?

Won't match cross-case.

> Also, by "lower-case equivalent" do you mean a case mapping defined by
> the UCD

Yes. Visual appearance is irrelevant.

Strictly speaking, to match a general character, you need to search for its
decomposition. If this character also has a case "equivalent" (as per
current-case-table), you can also search for the decomposition of this
equivalent.

[-- Attachment #2: Type: text/html, Size: 1599 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
  2015-12-04 16:21       ` Artur Malabarba
@ 2015-12-04 16:37         ` Random832
  2015-12-04 16:51           ` Artur Malabarba
  2015-12-04 18:24           ` Eli Zaretskii
  0 siblings, 2 replies; 26+ messages in thread
From: Random832 @ 2015-12-04 16:37 UTC (permalink / raw)
  To: 22090

On 2015-12-04, Artur Malabarba <bruce.connor.am@gmail.com> wrote:
> There were suggestions of projecting both the buffer and the search string
> (which is what case folding does) but nobody has offered to do it.
> What do you mean by "simple regexp"?

As in none of the huge character classes for folding, just the
characters the user types (normalized to all-lowercase or
all-uppercase for case-folding searches) but maybe e.g. "\s+"
for lax whitespace.






^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
  2015-12-04 16:37         ` Random832
@ 2015-12-04 16:51           ` Artur Malabarba
  2015-12-04 18:24           ` Eli Zaretskii
  1 sibling, 0 replies; 26+ messages in thread
From: Artur Malabarba @ 2015-12-04 16:51 UTC (permalink / raw)
  To: Random832; +Cc: 22090

[-- Attachment #1: Type: text/plain, Size: 496 bytes --]

On 4 Dec 2015 4:37 pm, "Random832" <random832@fastmail.com> wrote:
>
> On 2015-12-04, Artur Malabarba <bruce.connor.am@gmail.com> wrote:
> > There were suggestions of projecting both the buffer and the search
string
> > (which is what case folding does) but nobody has offered to do it.
> > What do you mean by "simple regexp"?
>
> As in...

I see. Then the answer is the same. Nobody has offered to write the C code
necessary to compare only a "projection" of the buffer with the search
string.

[-- Attachment #2: Type: text/html, Size: 722 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
       [not found] ` <mailman.1363.1449242229.31583.bug-gnu-emacs@gnu.org>
@ 2015-12-04 17:01   ` Alan Mackenzie
  2015-12-04 19:21   ` Alan Mackenzie
  1 sibling, 0 replies; 26+ messages in thread
From: Alan Mackenzie @ 2015-12-04 17:01 UTC (permalink / raw)
  To: 22090-done; +Cc: bruce.connor.am

Hello, Artur.

In article <mailman.1363.1449242229.31583.bug-gnu-emacs@gnu.org> you wrote:
> 2015-12-04 9:23 GMT+00:00 Eli Zaretskii <eliz@gnu.org>:
>>> Date: Fri, 4 Dec 2015 04:20:52 +0000
>>> From: Alan Mackenzie <acm@muc.de>
>>>
>>> With a recent emacs-25 (last update
>>> eaa1fd6dbff8346eb38485de5ebf0fbfacf374d9 from Thursday 2015-12-03):
>>>
>>> emacs -Q
>>> C-c C-f src/xdisp.c
>>> Move point to L30 (paragraph beginning "Updating the display is triggered
>>>   by the Lisp interpreter ...")
>>>
>>> C-s
>>> C-w repeatedly, to yank words onto the search string.
>>>
>>> After ~29 words have been yanked, the response becomes sluggish, pausing
>>> for between 0.5s and 1s before highlighting the "for" at the end of L31.

> Thanks for the report. The source for this (and for a similar bug
> mentioned on a thread in emacs-devel) was the code I had added for
> special case-folding support.
> For now, I've just removed the code. I can think of a way of solving
> this, but it adds some complexity to isearch, which I don't wanna do
> (and I don't think this feature was that important anyway). Here's a
> full copy of the commit message explaining why the bug happens.

Thanks for reacting to this so quickly.  I confirm that both symptoms of
the bug have been resolved.

So I'm closing this bug.

[ .... ]

-- 
Alan Mackenzie (Nuremberg, Germany).






^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
  2015-12-04 16:37         ` Random832
  2015-12-04 16:51           ` Artur Malabarba
@ 2015-12-04 18:24           ` Eli Zaretskii
  1 sibling, 0 replies; 26+ messages in thread
From: Eli Zaretskii @ 2015-12-04 18:24 UTC (permalink / raw)
  To: Random832; +Cc: 22090

> From: Random832 <random832@fastmail.com>
> Date: Fri, 4 Dec 2015 16:37:59 +0000 (UTC)
> 
> On 2015-12-04, Artur Malabarba <bruce.connor.am@gmail.com> wrote:
> > There were suggestions of projecting both the buffer and the search string
> > (which is what case folding does) but nobody has offered to do it.
> > What do you mean by "simple regexp"?
> 
> As in none of the huge character classes for folding, just the
> characters the user types (normalized to all-lowercase or
> all-uppercase for case-folding searches) but maybe e.g. "\s+"
> for lax whitespace.

You need to normalize the buffer text as well, so this must be done on
the C level.





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
  2015-12-04 16:37           ` Artur Malabarba
@ 2015-12-04 18:48             ` Eli Zaretskii
  2015-12-04 19:59               ` Artur Malabarba
  0 siblings, 1 reply; 26+ messages in thread
From: Eli Zaretskii @ 2015-12-04 18:48 UTC (permalink / raw)
  To: bruce.connor.am; +Cc: 22090, acm

> Date: Fri, 4 Dec 2015 16:37:08 +0000
> From: Artur Malabarba <bruce.connor.am@gmail.com>
> Cc: 22090@debbugs.gnu.org, Alan Mackenzie <acm@muc.de>
> 
> > What about ligatures, or symbols like ℻?
> 
> Won't match cross-case. 

So f will match ﬃ but not ℻, and F will match ℻ but not ﬃ, is that
right?

> > Also, by "lower-case equivalent" do you mean a case mapping defined by
> > the UCD
> 
> Yes. Visual appearance is irrelevant. 
> 
> Strictly speaking, to match a general character, you need to search for its
> decomposition. If this character also has a case "equivalent" (as per
> current-case-table), you can also search for the decomposition of this
> equivalent. 

OK, thanks.  I will try to think if this needs to be explained in the
manual.





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
       [not found] ` <mailman.1363.1449242229.31583.bug-gnu-emacs@gnu.org>
  2015-12-04 17:01   ` Alan Mackenzie
@ 2015-12-04 19:21   ` Alan Mackenzie
  2015-12-04 20:08     ` Eli Zaretskii
  2015-12-04 20:49     ` Artur Malabarba
  1 sibling, 2 replies; 26+ messages in thread
From: Alan Mackenzie @ 2015-12-04 19:21 UTC (permalink / raw)
  To: bruce.connor.am; +Cc: 22090

Hello, Artur.

In article <mailman.1363.1449242229.31583.bug-gnu-emacs@gnu.org> you wrote:
> 2015-12-04 9:23 GMT+00:00 Eli Zaretskii <eliz@gnu.org>:
>>> Date: Fri, 4 Dec 2015 04:20:52 +0000
>>> From: Alan Mackenzie <acm@muc.de>
>>>
>>> With a recent emacs-25 (last update
>>> eaa1fd6dbff8346eb38485de5ebf0fbfacf374d9 from Thursday 2015-12-03):
>>>
>>> emacs -Q
>>> C-c C-f src/xdisp.c
>>> Move point to L30 (paragraph beginning "Updating the display is triggered
>>>   by the Lisp interpreter ...")
>>>
>>> C-s
>>> C-w repeatedly, to yank words onto the search string.
>>>
>>> After ~29 words have been yanked, the response becomes sluggish, pausing
>>> for between 0.5s and 1s before highlighting the "for" at the end of L31.

> Thanks for the report. The source for this (and for a similar bug
> mentioned on a thread in emacs-devel) was the code I had added for
> special case-folding support.
> For now, I've just removed the code. I can think of a way of solving
> this, but it adds some complexity to isearch, which I don't wanna do
> (and I don't think this feature was that important anyway). Here's a
> full copy of the commit message explaining why the bug happens.

> 30f3432 * lisp/character-fold.el: Remove special case-folding support

> (character-fold-to-regexp): Remove special code for
> case-folding.  Char-fold search still respects the
> `case-fold-search' variable (i.e., f matches F).  This only
> removes the code that was added to ensure that f also matched
> all chars that F matched.  For instance, after this commit, f
> no longer matches 𝔽.

> This was necessary because the logic created a regexp with
> 2^(length of the string) redundant paths.  So, when a very
> long string "almost" matched, Emacs took a very long time to
> figure out that it didn't.  This became particularly relevant
> because isearch's lazy-highlight does a search bounded by (1-
> match-end) (which, in most circumstances, is a search that
> almost matches).  A recipe for this can be found in bug#22090.

Would you like any help to sort out these regexps?  I have some expertise
in doing this, having half-written fix-re.el, a program which analyses
and corrects just the sort of thing you're talking about.

-- 
Alan Mackenzie (Nuremberg, Germany).






^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
  2015-12-04 18:48             ` Eli Zaretskii
@ 2015-12-04 19:59               ` Artur Malabarba
  2015-12-05  9:19                 ` Eli Zaretskii
  0 siblings, 1 reply; 26+ messages in thread
From: Artur Malabarba @ 2015-12-04 19:59 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 22090, Alan Mackenzie

2015-12-04 18:48 GMT+00:00 Eli Zaretskii <eliz@gnu.org>:
>> > What about ligatures, or symbols like ℻?
>>
>> Won't match cross-case.
>
> So f will match ﬃ but not ℻, and F will match ℻ but not ﬃ, is that
> right?

No. ffi will match ﬃ, but FFI won't. And FAX will match ℻, but fax won't.
f shouldn't match ﬃ anymore.





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
  2015-12-04 19:21   ` Alan Mackenzie
@ 2015-12-04 20:08     ` Eli Zaretskii
  2015-12-04 20:49     ` Artur Malabarba
  1 sibling, 0 replies; 26+ messages in thread
From: Eli Zaretskii @ 2015-12-04 20:08 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: 22090, bruce.connor.am

> Date: 4 Dec 2015 19:21:26 -0000
> From: Alan Mackenzie <acm@muc.de>
> Cc: 22090@debbugs.gnu.org
> 
> > (character-fold-to-regexp): Remove special code for
> > case-folding.  Char-fold search still respects the
> > `case-fold-search' variable (i.e., f matches F).  This only
> > removes the code that was added to ensure that f also matched
> > all chars that F matched.  For instance, after this commit, f
> > no longer matches 𝔽.
> 
> > This was necessary because the logic created a regexp with
> > 2^(length of the string) redundant paths.  So, when a very
> > long string "almost" matched, Emacs took a very long time to
> > figure out that it didn't.  This became particularly relevant
> > because isearch's lazy-highlight does a search bounded by (1-
> > match-end) (which, in most circumstances, is a search that
> > almost matches).  A recipe for this can be found in bug#22090.
> 
> Would you like any help to sort out these regexps?

I'm not sure the use cases related to case folding should be working
in principle.  That's because normalization under case folding means
first downcase, then decompose; it is not allowed to downcase the
decomposition.  So if this issue is only about those, I don't think
there's anything to sort out here, thanks.





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
  2015-12-04 19:21   ` Alan Mackenzie
  2015-12-04 20:08     ` Eli Zaretskii
@ 2015-12-04 20:49     ` Artur Malabarba
  2015-12-04 23:00       ` Alan Mackenzie
  1 sibling, 1 reply; 26+ messages in thread
From: Artur Malabarba @ 2015-12-04 20:49 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: 22090

2015-12-04 19:21 GMT+00:00 Alan Mackenzie <acm@muc.de>:
> Would you like any help to sort out these regexps?  I have some expertise
> in doing this, having half-written fix-re.el, a program which analyses
> and corrects just the sort of thing you're talking about.

Maybe you can help then. The situation is actually quite simple.
We have a regexp for matching anything that 'a' should match (for
instance, that might look like "\\(a[´`]?\\|[áà𝑎]\\)"), and we have
another for matching anything that A could match (e.g.
"\\(A[`´]?\\|[ÁÀ]\\)").
When case-fold-search is on the previous code would simply join these
regexps with "\\(\\(a[´`]?\\|[áà𝑎]\\)\\|\\(A[`´]?\\|[ÁÀ]\\)\\)". The
problem is that (when case-fold-search is on) this creates a lot of
redundancy. There are two paths in that regexp that match "a", there
are two paths that match "à" and so on (but it's not full redundancy,
for instance, only one path matches 𝑎).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
  2015-12-04 20:49     ` Artur Malabarba
@ 2015-12-04 23:00       ` Alan Mackenzie
  2015-12-05 17:23         ` Artur Malabarba
  0 siblings, 1 reply; 26+ messages in thread
From: Alan Mackenzie @ 2015-12-04 23:00 UTC (permalink / raw)
  To: Artur Malabarba; +Cc: 22090

Hello, Artur.

On Fri, Dec 04, 2015 at 08:49:42PM +0000, Artur Malabarba wrote:
> 2015-12-04 19:21 GMT+00:00 Alan Mackenzie <acm@muc.de>:
> > Would you like any help to sort out these regexps?  I have some expertise
> > in doing this, having half-written fix-re.el, a program which analyses
> > and corrects just the sort of thing you're talking about.

> Maybe you can help then. The situation is actually quite simple.
> We have a regexp for matching anything that 'a' should match (for
> instance, that might look like "\\(a[´`]?\\|[áà𝑎]\\)"), and we have
> another for matching anything that A could match (e.g.
> "\\(A[`´]?\\|[ÁÀ]\\)").

Each of these regexps looks intrinsically blameless..

> When case-fold-search is on the previous code would simply join these
> regexps with "\\(\\(a[´`]?\\|[áà𝑎]\\)\\|\\(A[`´]?\\|[ÁÀ]\\)\\)".

Quick question: _why_ do you need to join them?  Given that
case-fold-search is enabled, couldn't you just use, say, the lower case
version?

> The problem is that (when case-fold-search is on) this creates a lot
> of redundancy. There are two paths in that regexp that match "a",
> there are two paths that match "à" and so on (but it's not full
> redundancy, for instance, only one path matches 𝑎).

Yes.  This is the killer danger in regexps (at least with the sort of
regexp engine we've got).  But it looks to me that this redundancy would
be quite easy to eliminate - you just need three regexp fragments for
the letter "a" - a lower case one, an upper case one and a
case-fold-search one.

The other thing is that for that single character "a" a 39 character
regexp fragment is being generated.  Might this have something to do
with the "[Too many words]" error I got last night (which comes from the
regexp engine returning a "too long regexp" error)?

Even if you can reduce that to, say 19 characters, that's only winning a
factor of 2 in the slide towards a too long regexp.  It might well be
that for a very long regexp, you might have to divide it into shorter
sections (a typical long RE will by a sequence of sub expressions,
rather than lots of alternatives inside \(...\|........\)).

-- 
Alan Mackenzie (Nuremberg, Germany).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
  2015-12-04 19:59               ` Artur Malabarba
@ 2015-12-05  9:19                 ` Eli Zaretskii
  0 siblings, 0 replies; 26+ messages in thread
From: Eli Zaretskii @ 2015-12-05  9:19 UTC (permalink / raw)
  To: bruce.connor.am; +Cc: 22090, acm

> Date: Fri, 4 Dec 2015 19:59:59 +0000
> From: Artur Malabarba <bruce.connor.am@gmail.com>
> Cc: 22090@debbugs.gnu.org, Alan Mackenzie <acm@muc.de>
> 
> 2015-12-04 18:48 GMT+00:00 Eli Zaretskii <eliz@gnu.org>:
> >> > What about ligatures, or symbols like ℻?
> >>
> >> Won't match cross-case.
> >
> > So f will match ﬃ but not ℻, and F will match ℻ but not ﬃ, is that
> > right?
> 
> No. ffi will match ﬃ, but FFI won't. And FAX will match ℻, but fax won't.
> f shouldn't match ﬃ anymore.

OK, thanks.  It seems like the text in the manual already complies
with the above.





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
  2015-12-04 23:00       ` Alan Mackenzie
@ 2015-12-05 17:23         ` Artur Malabarba
  2015-12-05 17:32           ` Eli Zaretskii
  2015-12-05 18:52           ` Alan Mackenzie
  0 siblings, 2 replies; 26+ messages in thread
From: Artur Malabarba @ 2015-12-05 17:23 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: 22090

nn2015-12-04 23:00 GMT+00:00 Alan Mackenzie <acm@muc.de>:
>> When case-fold-search is on the previous code would simply join these
>> regexps with "\\(\\(a[´`]?\\|[áà𝑎]\\)\\|\\(A[`´]?\\|[ÁÀ]\\)\\)".
>
> Quick question: _why_ do you need to join them?  Given that
> case-fold-search is enabled, couldn't you just use, say, the lower case
> version?

Because there are some characters in each regexp that don't have
lower/upper-case equivalents. For instance, if I use the
"\\(\\(a[´`]?\\|[áà𝑎]\\)" regexp, that's enough to match A or À, but
it's not enough to match a variety of other chars (𝔸𝕬𝖠𝗔𝘈𝘼𝙰🄰).

> it looks to me that this redundancy would
> be quite easy to eliminate - you just need three regexp fragments for
> the letter "a" - a lower case one, an upper case one and a
> case-fold-search one.

Yes, we could go that route. It's just going to add complexity to the
code that generates the char-fold-table (which is already quite dense)
and I wonder if it's worth such a corner-case. Like I said, 'a'
already matches A and À, how much do we want to support this extra
case-folding?

> The other thing is that for that single character "a" a 39 character
> regexp fragment is being generated.  Might this have something to do
> with the "[Too many words]" error I got last night (which comes from the
> regexp engine returning a "too long regexp" error)?

yes

> Even if you can reduce that to, say 19 characters, that's only winning a
> factor of 2 in the slide towards a too long regexp.  It might well be
> that for a very long regexp, you might have to divide it into shorter
> sections (a typical long RE will by a sequence of sub expressions,
> rather than lots of alternatives inside \(...\|........\)).

I don't understand what you mean. Could you elaborate?





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
  2015-12-05 17:23         ` Artur Malabarba
@ 2015-12-05 17:32           ` Eli Zaretskii
  2015-12-05 18:12             ` Artur Malabarba
  2015-12-05 18:52           ` Alan Mackenzie
  1 sibling, 1 reply; 26+ messages in thread
From: Eli Zaretskii @ 2015-12-05 17:32 UTC (permalink / raw)
  To: bruce.connor.am; +Cc: 22090, acm

> Date: Sat, 5 Dec 2015 17:23:53 +0000
> From: Artur Malabarba <bruce.connor.am@gmail.com>
> Cc: 22090@debbugs.gnu.org
> 
> nn2015-12-04 23:00 GMT+00:00 Alan Mackenzie <acm@muc.de>:
> >> When case-fold-search is on the previous code would simply join these
> >> regexps with "\\(\\(a[´`]?\\|[áà𝑎]\\)\\|\\(A[`´]?\\|[ÁÀ]\\)\\)".
> >
> > Quick question: _why_ do you need to join them?  Given that
> > case-fold-search is enabled, couldn't you just use, say, the lower case
> > version?
> 
> Because there are some characters in each regexp that don't have
> lower/upper-case equivalents. For instance, if I use the
> "\\(\\(a[´`]?\\|[áà𝑎]\\)" regexp, that's enough to match A or À, but
> it's not enough to match a variety of other chars (𝔸𝕬𝖠𝗔𝘈𝘼𝙰🄰).

You don't need to match the latter set.  Character folding is applied
_after_ case folding, not before.  So characters that don't have a
lower-case variant simply shouldn't match a lower-case a -- and they
won't, if you just let case-insensitive regexp matching do its job.





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
  2015-12-05 17:32           ` Eli Zaretskii
@ 2015-12-05 18:12             ` Artur Malabarba
  2015-12-05 18:34               ` Eli Zaretskii
  0 siblings, 1 reply; 26+ messages in thread
From: Artur Malabarba @ 2015-12-05 18:12 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 22090, Alan Mackenzie

2015-12-05 17:32 GMT+00:00 Eli Zaretskii <eliz@gnu.org>:
>> Because there are some characters in each regexp that don't have
>> lower/upper-case equivalents. For instance, if I use the
>> "\\(\\(a[´`]?\\|[áà𝑎]\\)" regexp, that's enough to match A or À, but
>> it's not enough to match a variety of other chars (𝔸𝕬𝖠𝗔𝘈𝘼𝙰🄰).
>
> You don't need to match the latter set.  Character folding is applied
> _after_ case folding, not before.  So characters that don't have a
> lower-case variant simply shouldn't match a lower-case a -- and they
> won't, if you just let case-insensitive regexp matching do its job.

Given that char-folding is a new feature, how it combines with
case-folding is entirely up to us, and I have really no idea what
would be TRT.
However, if that is your opinion, I'm more than happy to accept that
the current situation ('a' doesn't match '𝔸𝕬𝖠𝗔𝘈𝘼𝙰🄰') is TRT,
given that it has the simplest implementation. :-)





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
  2015-12-05 18:12             ` Artur Malabarba
@ 2015-12-05 18:34               ` Eli Zaretskii
  0 siblings, 0 replies; 26+ messages in thread
From: Eli Zaretskii @ 2015-12-05 18:34 UTC (permalink / raw)
  To: bruce.connor.am; +Cc: 22090, acm

> Date: Sat, 5 Dec 2015 18:12:52 +0000
> From: Artur Malabarba <bruce.connor.am@gmail.com>
> Cc: Alan Mackenzie <acm@muc.de>, 22090@debbugs.gnu.org
> 
> 2015-12-05 17:32 GMT+00:00 Eli Zaretskii <eliz@gnu.org>:
> >> Because there are some characters in each regexp that don't have
> >> lower/upper-case equivalents. For instance, if I use the
> >> "\\(\\(a[´`]?\\|[áà𝑎]\\)" regexp, that's enough to match A or À, but
> >> it's not enough to match a variety of other chars (𝔸𝕬𝖠𝗔𝘈𝘼𝙰🄰).
> >
> > You don't need to match the latter set.  Character folding is applied
> > _after_ case folding, not before.  So characters that don't have a
> > lower-case variant simply shouldn't match a lower-case a -- and they
> > won't, if you just let case-insensitive regexp matching do its job.
> 
> Given that char-folding is a new feature, how it combines with
> case-folding is entirely up to us, and I have really no idea what
> would be TRT.

I don't think there's any reasonable alternative, because for
characters that have a decomposition, you wouldn't downcase the result
of the decomposition, would you?

The Unicode Standard also says this much (p.158):

  In principle, normalization needs to be done after case folding,
  because case folding does not preserve the normalized form of
  strings in all instances.

(There are a couple of examples there showing why the reverse order
could cause incorrect results.)

So if this is true for normalization, it should also be true for the
case in point.

> However, if that is your opinion, I'm more than happy to accept that
> the current situation ('a' doesn't match '𝔸𝕬𝖠𝗔𝘈𝘼𝙰🄰') is TRT,
> given that it has the simplest implementation. :-)

Yes, I think it's TRT.





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
  2015-12-05 17:23         ` Artur Malabarba
  2015-12-05 17:32           ` Eli Zaretskii
@ 2015-12-05 18:52           ` Alan Mackenzie
  2015-12-06 12:50             ` Artur Malabarba
  1 sibling, 1 reply; 26+ messages in thread
From: Alan Mackenzie @ 2015-12-05 18:52 UTC (permalink / raw)
  To: Artur Malabarba; +Cc: 22090

Hello, Artur.

On Sat, Dec 05, 2015 at 05:23:53PM +0000, Artur Malabarba wrote:
> nn2015-12-04 23:00 GMT+00:00 Alan Mackenzie <acm@muc.de>:
> >> When case-fold-search is on the previous code would simply join these
> >> regexps with "\\(\\(a[´`]?\\|[áà𝑎]\\)\\|\\(A[`´]?\\|[ÁÀ]\\)\\)".

> > Quick question: _why_ do you need to join them?  Given that
> > case-fold-search is enabled, couldn't you just use, say, the lower case
> > version?

> Because there are some characters in each regexp that don't have
> lower/upper-case equivalents. For instance, if I use the
> "\\(\\(a[´`]?\\|[áà𝑎]\\)" regexp, that's enough to match A or À, but
> it's not enough to match a variety of other chars (𝔸𝕬𝖠𝗔𝘈𝘼𝙰🄰).

OK, thanks.

> > it looks to me that this redundancy would
> > be quite easy to eliminate - you just need three regexp fragments for
> > the letter "a" - a lower case one, an upper case one and a
> > case-fold-search one.

> Yes, we could go that route. It's just going to add complexity to the
> code that generates the char-fold-table (which is already quite dense)
> and I wonder if it's worth such a corner-case. Like I said, 'a'
> already matches A and À, how much do we want to support this extra
> case-folding?

But it seems the complexity (and it can't honestly be that much,
surely?) is intrinsic to the task being carried out.  Sticking a "\\|"
between the upper case and lower case versions clearly doesn't work.

Seriously, how difficult can it be to generate

    "\\([Aa][´`]?\\|[áà𝑎ÁÀ]\\)"

, which is a blameless regexp, given where you've already got to?

> > The other thing is that for that single character "a" a 39 character
> > regexp fragment is being generated.  Might this have something to do
> > with the "[Too many words]" error I got last night (which comes from the
> > regexp engine returning a "too long regexp" error)?

> yes

I was afraid of that.

> > Even if you can reduce that to, say 19 characters, that's only winning a
> > factor of 2 in the slide towards a too long regexp.  It might well be
> > that for a very long regexp, you might have to divide it into shorter
> > sections (a typical long RE will by a sequence of sub expressions,
> > rather than lots of alternatives inside \(...\|........\)).

> I don't understand what you mean. Could you elaborate?

Once you've generated the long regexp, if it's too long, you can split
it up into, say, 3 pieces A, B, C, such that (equal re (concat A B C)).

Then you can do something like:

    (and (search-forward-regexp A bound noerror)
    	 (search-forward-regexp (concat "\\=" B) bound noerror)
	 (search-forward-regexp (concat "\\=" C) bound noerror))

.  Though, thinking about it, it might be less painful to enhance the
regexp engine to take longer regexps.

-- 
Alan Mackenzie (Nuremberg, Germany).





^ permalink raw reply	[flat|nested] 26+ messages in thread

* bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]".
  2015-12-05 18:52           ` Alan Mackenzie
@ 2015-12-06 12:50             ` Artur Malabarba
  0 siblings, 0 replies; 26+ messages in thread
From: Artur Malabarba @ 2015-12-06 12:50 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: 22090

2015-12-05 18:52 GMT+00:00 Alan Mackenzie <acm@muc.de>:
> But it seems the complexity (and it can't honestly be that much,
> surely?) is intrinsic to the task being carried out.  Sticking a "\\|"
> between the upper case and lower case versions clearly doesn't work.
>
> Seriously, how difficult can it be to generate
>
>     "\\([Aa][´`]?\\|[áà𝑎ÁÀ]\\)"
>
> , which is a blameless regexp, given where you've already got to?

Oh. I see. I thought you were talking about mutually exclusive
regexps. Indeed a regexp like that would be trivial to generate. But
is it really blameless? I mean, if "\\(A\\|a\\)" can lead to extremely
slow searches, doesn't the same happen with "[Aa]"?

Anyway, at this point I'm just asking for future knowledge/reference.
According to Eli, the current implementation is in accordance with the
Unicode Standard. So it's probably best to keep it this way at least
for the first release of the feature.

> Once you've generated the long regexp, if it's too long, you can split
> it up into, say, 3 pieces A, B, C, such that (equal re (concat A B C)).
>
> Then you can do something like:
>
>     (and (search-forward-regexp A bound noerror)
>          (search-forward-regexp (concat "\\=" B) bound noerror)
>          (search-forward-regexp (concat "\\=" C) bound noerror))
>
> .  Though, thinking about it, it might be less painful to enhance the
> regexp engine to take longer regexps.

Besides. Char-folding is supposed to turn strings into regexps usable
anywhere, and this wouldn't work with that.

I've added a clause to the function so that it won't do any
charfolding if the resulting regexp would be longer than 5k chars
(instead it will just regexp-quote). That will at least prevent the
too-many words error in isearch. (I already had this clause in there
before, but it was using 10k, which apparently is not enough).

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2015-12-06 12:50 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-12-04  4:20 bug#22090: Isearch is sluggish and eventually refuses further service with "[Too many words]" Alan Mackenzie
2015-12-04  9:23 ` Eli Zaretskii
2015-12-04 15:16   ` Artur Malabarba
2015-12-04 15:23     ` Eli Zaretskii
2015-12-04 16:06       ` Artur Malabarba
2015-12-04 16:27         ` Eli Zaretskii
2015-12-04 16:37           ` Artur Malabarba
2015-12-04 18:48             ` Eli Zaretskii
2015-12-04 19:59               ` Artur Malabarba
2015-12-05  9:19                 ` Eli Zaretskii
2015-12-04 15:49     ` Random832
2015-12-04 16:21       ` Artur Malabarba
2015-12-04 16:37         ` Random832
2015-12-04 16:51           ` Artur Malabarba
2015-12-04 18:24           ` Eli Zaretskii
     [not found] ` <mailman.1363.1449242229.31583.bug-gnu-emacs@gnu.org>
2015-12-04 17:01   ` Alan Mackenzie
2015-12-04 19:21   ` Alan Mackenzie
2015-12-04 20:08     ` Eli Zaretskii
2015-12-04 20:49     ` Artur Malabarba
2015-12-04 23:00       ` Alan Mackenzie
2015-12-05 17:23         ` Artur Malabarba
2015-12-05 17:32           ` Eli Zaretskii
2015-12-05 18:12             ` Artur Malabarba
2015-12-05 18:34               ` Eli Zaretskii
2015-12-05 18:52           ` Alan Mackenzie
2015-12-06 12:50             ` Artur Malabarba

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).