bug#13084: boyer_moore crashes with certain characters in the case table

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#13084: boyer_moore crashes with certain characters in the case table
@ 2012-12-05  0:34 Juri Linkov
  2012-12-11 15:37 ` Eli Zaretskii
  0 siblings, 1 reply; 14+ messages in thread
From: Juri Linkov @ 2012-12-05  0:34 UTC (permalink / raw)
  To: 13084

The minimal reproducible recipe for crashes in boyer_moore noticed in bug#13041:

1. emacs -Q

2. Eval in *scratch*:

(let ((table (standard-case-table)) canon)
  (setq canon (copy-sequence table))
  (aset canon #xff59 ?y)
  (set-char-table-extra-slot table 1 canon)
  (set-char-table-extra-slot table 2 nil)
  (set-standard-case-table table))

3. Start an activity that includes a search, e.g. `C-x 8 RET TAB'

The crash in boyer_moore is caused by fullwidth characters like #xff59
whose Unicode properties are:

  name: FULLWIDTH LATIN SMALL LETTER Y
  decomposition: (wide 121) (wide 'y')

However, the crash doesn't occur when the same fullwidth characters are
set to their downcase counterparts in lisp/international/characters.el:

  ;; Fullwidth Latin
  (setq c #xff21)
  (while (<= c #xff3a)
    (set-case-syntax-pair c (+ c #x20) tbl)
    (modify-category-entry c ?l)
    (modify-category-entry (+ c #x20) ?l)
    (setq c (1+ c)))





^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#13084: boyer_moore crashes with certain characters in the case table
  2012-12-05  0:34 bug#13084: boyer_moore crashes with certain characters in the case table Juri Linkov
@ 2012-12-11 15:37 ` Eli Zaretskii
  2012-12-11 23:17   ` Juri Linkov
  2012-12-13 13:39   ` Kenichi Handa
  0 siblings, 2 replies; 14+ messages in thread
From: Eli Zaretskii @ 2012-12-11 15:37 UTC (permalink / raw)
  To: Juri Linkov, Kenichi Handa; +Cc: 13084

> From: Juri Linkov <juri@jurta.org>
> Date: Wed, 05 Dec 2012 02:34:39 +0200
> 
> The minimal reproducible recipe for crashes in boyer_moore noticed in bug#13041:
> 
> 1. emacs -Q
> 
> 2. Eval in *scratch*:
> 
> (let ((table (standard-case-table)) canon)
>   (setq canon (copy-sequence table))
>   (aset canon #xff59 ?y)
>   (set-char-table-extra-slot table 1 canon)
>   (set-char-table-extra-slot table 2 nil)
>   (set-standard-case-table table))
> 
> 3. Start an activity that includes a search, e.g. `C-x 8 RET TAB'

Thanks.  I think i fixed this (revision 111021 on the emacs-24
branch), please test.

In addition, I'd suggest that Handa-san (or someone else) takes a good
look at the code that sets up the simple_translate table in
boyer_moore, because the constants there, like 0200 and 0x3F, and all
the talk about characters that belong "to the same charset and row"
smell of pre-Unicode (a.k.a. "MULE") representation of characters.
For now, I disabled boyer_moore for unibyte characters beyond 160,
because my reading of the code is that simple_translate and the
supporting code cannot handle that.  Maybe I'm wrong.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#13084: boyer_moore crashes with certain characters in the case table
  2012-12-11 15:37 ` Eli Zaretskii
@ 2012-12-11 23:17   ` Juri Linkov
  2012-12-12  3:55     ` Eli Zaretskii
  2012-12-13 13:39   ` Kenichi Handa
  1 sibling, 1 reply; 14+ messages in thread
From: Juri Linkov @ 2012-12-11 23:17 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 13084

> I think i fixed this (revision 111021 on the emacs-24 branch),
> please test.

Thanks, there are no more crashes when using code from
http://debbugs.gnu.org/13041#41

Does this mean there are no more obstacles to filling a translation table
for ignoring equivalence with all character mappings according to the
`decomposition' property?  This would be the first step in this direction.





^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#13084: boyer_moore crashes with certain characters in the case table
  2012-12-11 23:17   ` Juri Linkov
@ 2012-12-12  3:55     ` Eli Zaretskii
  2012-12-12  9:27       ` Juri Linkov
  0 siblings, 1 reply; 14+ messages in thread
From: Eli Zaretskii @ 2012-12-12  3:55 UTC (permalink / raw)
  To: Juri Linkov; +Cc: 13084

> From: Juri Linkov <juri@jurta.org>
> Cc: Kenichi Handa <handa@gnu.org>,  13084@debbugs.gnu.org
> Date: Wed, 12 Dec 2012 01:17:04 +0200
> 
> > I think i fixed this (revision 111021 on the emacs-24 branch),
> > please test.
> 
> Thanks, there are no more crashes when using code from
> http://debbugs.gnu.org/13041#41
> 
> Does this mean there are no more obstacles to filling a translation table
> for ignoring equivalence with all character mappings according to the
> `decomposition' property?  This would be the first step in this direction.

I'm not sure I understand what you are asking.  Please show more
details.





^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#13084: boyer_moore crashes with certain characters in the case table
  2012-12-12  3:55     ` Eli Zaretskii
@ 2012-12-12  9:27       ` Juri Linkov
  2012-12-12 10:21         ` martin rudalics
  2012-12-12 16:47         ` Eli Zaretskii
  0 siblings, 2 replies; 14+ messages in thread
From: Juri Linkov @ 2012-12-12  9:27 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 13084

>> Does this mean there are no more obstacles to filling a translation table
>> for ignoring equivalence with all character mappings according to the
>> `decomposition' property?  This would be the first step in this direction.
>
> I'm not sure I understand what you are asking.  Please show more details.

There is confusion with the word `equivalence'.  Currently there
exists the case equivalence table in the case table (`case_eqv_table').
Implementing a diacritic search in bug#13041 requires adding a new
similar table.  I don't know what would be a good name:
`decomposition_eqv_table' or `normalization_eqv_table' or something better.

I'm unfamiliar with the details of `search_buffer', but in principle
using two tables in the macro `TRANSLATE' could implement a diacritic
search where at the first step the character will be translated using
`decomposition_eqv_table', and after that the resulting character
will be translated using `case_eqv_table'.

So the dataflow to get the canonical character will be Á -> A -> a.
If `case-fold-search' is nil, then Á -> A.  If a new variable
`decomposition-search' (or `normalized-search') is nil then Á -> á.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#13084: boyer_moore crashes with certain characters in the case table
  2012-12-12  9:27       ` Juri Linkov
@ 2012-12-12 10:21         ` martin rudalics
  2012-12-12 10:31           ` Juri Linkov
  2012-12-12 16:47         ` Eli Zaretskii
  1 sibling, 1 reply; 14+ messages in thread
From: martin rudalics @ 2012-12-12 10:21 UTC (permalink / raw)
  To: Juri Linkov; +Cc: 13084

 > So the dataflow to get the canonical character will be Á -> A -> a.
 > If `case-fold-search' is nil, then Á -> A.  If a new variable
 > `decomposition-search' (or `normalized-search') is nil then Á -> á.

Any such table should allow handling asymmetric searches: That is,
searching for "ába" should match "ába" "ábà" and "ábá" but not "aba" or
"àbá".  Can we do that?

martin






^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#13084: boyer_moore crashes with certain characters in the case table
  2012-12-12 10:21         ` martin rudalics
@ 2012-12-12 10:31           ` Juri Linkov
  2012-12-12 12:43             ` martin rudalics
  0 siblings, 1 reply; 14+ messages in thread
From: Juri Linkov @ 2012-12-12 10:31 UTC (permalink / raw)
  To: martin rudalics; +Cc: 13084

>> So the dataflow to get the canonical character will be Á -> A -> a.
>> If `case-fold-search' is nil, then Á -> A.  If a new variable
>> `decomposition-search' (or `normalized-search') is nil then Á -> á.
>
> Any such table should allow handling asymmetric searches: That is,
> searching for "ába" should match "ába" "ábà" and "ábá" but not "aba" or
> "àbá".  Can we do that?

IIUC what you mean is something like `search-upper-case'
where upper case chars disable case fold searching,
so "Aba" should match "Aba" and "AbA" but not "aba".





^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#13084: boyer_moore crashes with certain characters in the case table
  2012-12-12 10:31           ` Juri Linkov
@ 2012-12-12 12:43             ` martin rudalics
  0 siblings, 0 replies; 14+ messages in thread
From: martin rudalics @ 2012-12-12 12:43 UTC (permalink / raw)
  To: Juri Linkov; +Cc: 13084

> IIUC what you mean is something like `search-upper-case'
> where upper case chars disable case fold searching,
> so "Aba" should match "Aba" and "AbA" but not "aba".

Yes. I think that's a very good explanation in Emacs terms.

martin





^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#13084: boyer_moore crashes with certain characters in the case table
  2012-12-12  9:27       ` Juri Linkov
  2012-12-12 10:21         ` martin rudalics
@ 2012-12-12 16:47         ` Eli Zaretskii
  2012-12-12 23:05           ` Juri Linkov
  1 sibling, 1 reply; 14+ messages in thread
From: Eli Zaretskii @ 2012-12-12 16:47 UTC (permalink / raw)
  To: Juri Linkov; +Cc: 13084

> From: Juri Linkov <juri@jurta.org>
> Cc: handa@gnu.org,  13084@debbugs.gnu.org
> Date: Wed, 12 Dec 2012 11:27:50 +0200
> 
> >> Does this mean there are no more obstacles to filling a translation table
> >> for ignoring equivalence with all character mappings according to the
> >> `decomposition' property?  This would be the first step in this direction.
> >
> > I'm not sure I understand what you are asking.  Please show more details.
> 
> There is confusion with the word `equivalence'.  Currently there
> exists the case equivalence table in the case table (`case_eqv_table').
> Implementing a diacritic search in bug#13041 requires adding a new
> similar table.  I don't know what would be a good name:
> `decomposition_eqv_table' or `normalization_eqv_table' or something better.
> 
> I'm unfamiliar with the details of `search_buffer', but in principle
> using two tables in the macro `TRANSLATE' could implement a diacritic
> search where at the first step the character will be translated using
> `decomposition_eqv_table', and after that the resulting character
> will be translated using `case_eqv_table'.
> 
> So the dataflow to get the canonical character will be Á -> A -> a.
> If `case-fold-search' is nil, then Á -> A.  If a new variable
> `decomposition-search' (or `normalized-search') is nil then Á -> á.

OK, all this is now clear and agreed.  So what did you mean by "no
more obstacles" above?  The obstacles I see is that case tables aren't
up to the job because they don't support ignoring of characters, and
the code in search.c cannot handle ignoring even if the table did
support that.  These obstacles still stand.






^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#13084: boyer_moore crashes with certain characters in the case table
  2012-12-12 16:47         ` Eli Zaretskii
@ 2012-12-12 23:05           ` Juri Linkov
  0 siblings, 0 replies; 14+ messages in thread
From: Juri Linkov @ 2012-12-12 23:05 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 13084-done

> So what did you mean by "no more obstacles" above?

By obstacles I meant crashes that you fixed.
Thanks for that.  I'm closing this bug.





^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#13084: boyer_moore crashes with certain characters in the case table
  2012-12-11 15:37 ` Eli Zaretskii
  2012-12-11 23:17   ` Juri Linkov
@ 2012-12-13 13:39   ` Kenichi Handa
  2012-12-13 17:32     ` Eli Zaretskii
  1 sibling, 1 reply; 14+ messages in thread
From: Kenichi Handa @ 2012-12-13 13:39 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 13084

In article <831uewa9cq.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> In addition, I'd suggest that Handa-san (or someone else) takes a good
> look at the code that sets up the simple_translate table in
> boyer_moore, because the constants there, like 0200 and 0x3F, and all
> the talk about characters that belong "to the same charset and row"
> smell of pre-Unicode (a.k.a. "MULE") representation of characters.
> For now, I disabled boyer_moore for unibyte characters beyond 160,
> because my reading of the code is that simple_translate and the
> supporting code cannot handle that.  Maybe I'm wrong.

I have not yet checked the code, but what I remember is that
search_buffer checks the search string and decides which to
use; boyer_moore or simple_search.  If all equivalent
characters of all non-ASCII characters in the search string
are in the same character group, we can use boyer_moore.
Here, A and B belongs to the same character group iff A and
B has the same multibyte sequence except for the last byte.
In this condition, we should be able to use the table
simple_translate.

---
Kenichi Handa
handa@gnu.org

^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#13084: boyer_moore crashes with certain characters in the case table
  2012-12-13 13:39   ` Kenichi Handa
@ 2012-12-13 17:32     ` Eli Zaretskii
  2012-12-15 13:17       ` Kenichi Handa
  0 siblings, 1 reply; 14+ messages in thread
From: Eli Zaretskii @ 2012-12-13 17:32 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: 13084

> From: Kenichi Handa <handa@gnu.org>
> Cc: juri@jurta.org, 13084@debbugs.gnu.org
> Date: Thu, 13 Dec 2012 22:39:29 +0900
> 
> I have not yet checked the code, but what I remember is that
> search_buffer checks the search string and decides which to
> use; boyer_moore or simple_search.  If all equivalent
> characters of all non-ASCII characters in the search string
> are in the same character group, we can use boyer_moore.

Yes, that's my reading of the code as well.

> Here, A and B belongs to the same character group iff A and
> B has the same multibyte sequence except for the last byte.
> In this condition, we should be able to use the table
> simple_translate.

OK, then maybe just the comments need to be fixed.  They shouldn't
talk about "charset" and "row", which are undefined in Unicode Emacs.
They should instead use terminology that correspond to UTF-8 multibyte
representation of characters we use today.





^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#13084: boyer_moore crashes with certain characters in the case table
  2012-12-13 17:32     ` Eli Zaretskii
@ 2012-12-15 13:17       ` Kenichi Handa
  2012-12-15 13:55         ` Eli Zaretskii
  0 siblings, 1 reply; 14+ messages in thread
From: Kenichi Handa @ 2012-12-15 13:17 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 13084

In article <83obhxoo2v.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> > Here, A and B belongs to the same character group iff A and
> > B has the same multibyte sequence except for the last byte.
> > In this condition, we should be able to use the table
> > simple_translate.

> OK, then maybe just the comments need to be fixed.  They shouldn't
> talk about "charset" and "row", which are undefined in Unicode Emacs.
> They should instead use terminology that correspond to UTF-8 multibyte
> representation of characters we use today.

I've just committed this change.  How is it?

=== modified file 'src/search.c'
--- src/search.c	2012-10-10 20:09:47 +0000
+++ src/search.c	2012-12-15 13:04:46 +0000
@@ -1313,8 +1313,11 @@
 	     non-nil, we can use boyer-moore search only if TRT can be
 	     represented by the byte array of 256 elements.  For that,
 	     all non-ASCII case-equivalents of all case-sensitive
-	     characters in STRING must belong to the same charset and
-	     row.  */
+	     characters in STRING must belong to the same character
+	     group (two characters belong to the same group iff their
+	     multibyte forms are the same except for the last byte;
+	     i.e. every 64 characters form a group; U+0000..U+003F,
+	     U+0040..U+007F, U+0080..U+00BF, ...).  */
 
 	  while (--len >= 0)
 	    {

---
Kenichi Handa
handa@gnu.org





^ permalink raw reply	[flat|nested] 14+ messages in thread

* bug#13084: boyer_moore crashes with certain characters in the case table
  2012-12-15 13:17       ` Kenichi Handa
@ 2012-12-15 13:55         ` Eli Zaretskii
  0 siblings, 0 replies; 14+ messages in thread
From: Eli Zaretskii @ 2012-12-15 13:55 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: 13084

> From: Kenichi Handa <handa@gnu.org>
> Cc: juri@jurta.org, 13084@debbugs.gnu.org
> Date: Sat, 15 Dec 2012 22:17:17 +0900
> 
> In article <83obhxoo2v.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:
> 
> > > Here, A and B belongs to the same character group iff A and
> > > B has the same multibyte sequence except for the last byte.
> > > In this condition, we should be able to use the table
> > > simple_translate.
> 
> > OK, then maybe just the comments need to be fixed.  They shouldn't
> > talk about "charset" and "row", which are undefined in Unicode Emacs.
> > They should instead use terminology that correspond to UTF-8 multibyte
> > representation of characters we use today.
> 
> I've just committed this change.  How is it?

Clear, thanks.





^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-12-15 13:55 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-05  0:34 bug#13084: boyer_moore crashes with certain characters in the case table Juri Linkov
2012-12-11 15:37 ` Eli Zaretskii
2012-12-11 23:17   ` Juri Linkov
2012-12-12  3:55     ` Eli Zaretskii
2012-12-12  9:27       ` Juri Linkov
2012-12-12 10:21         ` martin rudalics
2012-12-12 10:31           ` Juri Linkov
2012-12-12 12:43             ` martin rudalics
2012-12-12 16:47         ` Eli Zaretskii
2012-12-12 23:05           ` Juri Linkov
2012-12-13 13:39   ` Kenichi Handa
2012-12-13 17:32     ` Eli Zaretskii
2012-12-15 13:17       ` Kenichi Handa
2012-12-15 13:55         ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).