unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* Problem with Boyer Moore and Greek characters
@ 2002-04-22 23:44 Thomas Morgan
  0 siblings, 0 replies; 5+ messages in thread
From: Thomas Morgan @ 2002-04-22 23:44 UTC (permalink / raw)


I ran GNU Emacs 21.1.1 (i686-pc-linux-gnu, X toolkit) with the options
`--q --no-site-file', then typed the following into `*scratch*':

  (search-forward "ί")
  ύ

(The first Greek character is an accented iota represented in Emacs by
the character number 342199, and the second is an accented upsilon
represented by 342203.  I entered them with the input method
`greek-ibycus4'.)

Then I pressed `C-p' and `C-e' to move point to the end of the first
line, and `C-x C-e' to evaluate the expression.

Here is the exact input for all of that:

( s e a r c h - f o r w a r d SPC " C-x <return> C-\ 
g r e e k - i b y c u s 4 <return> i ' C-\ " ) <return> 
C-\ u ' C-\ C-p C-e C-x C-e

This moved the cursor to the end of the second line, and displayed
`214', the new position of point, in the echo area.  So searching for
the iota found the upsilon.  This must be a bug.

Boyer Moore searching compares only the last bytes of the characters,
and this leads to the problem.  If you capitalize the accented iota,
the last byte is the same as the last byte of the upsilon, although
their second-to-last bytes are different.

Capital accented iota	\234\364\362\273
Small accented upsilon	\234\364\361\273

So before doing a Boyer Moore search, `search_buffer' needs to check
that the character and its inversion have the same first three bytes.
Here is the patch I made to do that.  Please forgive my mistakes; I am
not a programmer.

cd ~/emacs-21.1/src/
diff -c /home/tlm/emacs-21.1/src/search.c.\~1\~ /home/tlm/emacs-21.1/src/search.c
*** /home/tlm/emacs-21.1/src/search.c.~1~	Mon Oct  1 02:08:20 2001
--- /home/tlm/emacs-21.1/src/search.c	Wed Apr  3 07:53:39 2002
***************
*** 1237,1243 ****
  		  /* Keep track of which character set row
  		     contains the characters that need translation.  */
  		  int charset_base_code = c & ~CHAR_FIELD3_MASK;
! 		  if (charset_base == -1)
  		    charset_base = charset_base_code;
  		  else if (charset_base != charset_base_code)
  		    /* If two different rows appear, needing translation,
--- 1237,1246 ----
  		  /* Keep track of which character set row
  		     contains the characters that need translation.  */
  		  int charset_base_code = c & ~CHAR_FIELD3_MASK;
! 		  int inverse_charset_base = inverse & ~CHAR_FIELD3_MASK;
! 		  if (charset_base_code != inverse_charset_base)
! 		    boyer_moore_ok = 0;
! 		  else if (charset_base == -1)
  		    charset_base = charset_base_code;
  		  else if (charset_base != charset_base_code)
  		    /* If two different rows appear, needing translation,

Diff finished at Wed Apr  3 08:00:10

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Problem with Boyer Moore and Greek characters
@ 2002-05-07 13:35 Kenichi Handa
  2002-05-12 16:44 ` Richard Stallman
  0 siblings, 1 reply; 5+ messages in thread
From: Kenichi Handa @ 2002-05-07 13:35 UTC (permalink / raw)
  Cc: bug-gnu-emacs

Sorry for the late reply on this matter.

Although I don't understand this part of code fully, it
seems that your fix is correct.  Richard, what do you think?
Shall I install it (both in HEAD and RC)?

---
Ken'ichi HANDA
handa@etl.go.jp

Thomas Morgan <tlm@pocketmail.com> writes:
> I ran GNU Emacs 21.1.1 (i686-pc-linux-gnu, X toolkit) with the options
> `--q --no-site-file', then typed the following into `*scratch*':

>   (search-forward "ί")
>   ύ

> (The first Greek character is an accented iota represented in Emacs by
> the character number 342199, and the second is an accented upsilon
> represented by 342203.  I entered them with the input method
> `greek-ibycus4'.)

> Then I pressed `C-p' and `C-e' to move point to the end of the first
> line, and `C-x C-e' to evaluate the expression.

> Here is the exact input for all of that:

> ( s e a r c h - f o r w a r d SPC " C-x <return> C-\ 
> g r e e k - i b y c u s 4 <return> i ' C-\ " ) <return> 
> C-\ u ' C-\ C-p C-e C-x C-e

> This moved the cursor to the end of the second line, and displayed
> `214', the new position of point, in the echo area.  So searching for
> the iota found the upsilon.  This must be a bug.

> Boyer Moore searching compares only the last bytes of the characters,
> and this leads to the problem.  If you capitalize the accented iota,
> the last byte is the same as the last byte of the upsilon, although
> their second-to-last bytes are different.

> Capital accented iota	\234\364\362\273
> Small accented upsilon	\234\364\361\273

> So before doing a Boyer Moore search, `search_buffer' needs to check
> that the character and its inversion have the same first three bytes.
> Here is the patch I made to do that.  Please forgive my mistakes; I am
> not a programmer.

> cd ~/emacs-21.1/src/
> diff -c /home/tlm/emacs-21.1/src/search.c.\~1\~ /home/tlm/emacs-21.1/src/search.c
> *** /home/tlm/emacs-21.1/src/search.c.~1~	Mon Oct  1 02:08:20 2001
> --- /home/tlm/emacs-21.1/src/search.c	Wed Apr  3 07:53:39 2002
> ***************
> *** 1237,1243 ****
>   		  /* Keep track of which character set row
>   		     contains the characters that need translation.  */
>   		  int charset_base_code = c & ~CHAR_FIELD3_MASK;
> ! 		  if (charset_base == -1)
>   		    charset_base = charset_base_code;
>   		  else if (charset_base != charset_base_code)
>   		    /* If two different rows appear, needing translation,
> --- 1237,1246 ----
>   		  /* Keep track of which character set row
>   		     contains the characters that need translation.  */
>   		  int charset_base_code = c & ~CHAR_FIELD3_MASK;
> ! 		  int inverse_charset_base = inverse & ~CHAR_FIELD3_MASK;
> ! 		  if (charset_base_code != inverse_charset_base)
> ! 		    boyer_moore_ok = 0;
> ! 		  else if (charset_base == -1)
>   		    charset_base = charset_base_code;
>   		  else if (charset_base != charset_base_code)
>   		    /* If two different rows appear, needing translation,

> Diff finished at Wed Apr  3 08:00:10


> _______________________________________________
> Bug-gnu-emacs mailing list
> Bug-gnu-emacs@gnu.org
> http://mail.gnu.org/mailman/listinfo/bug-gnu-emacs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Problem with Boyer Moore and Greek characters
  2002-05-07 13:35 Kenichi Handa
@ 2002-05-12 16:44 ` Richard Stallman
  0 siblings, 0 replies; 5+ messages in thread
From: Richard Stallman @ 2002-05-12 16:44 UTC (permalink / raw)
  Cc: tlm, bug-gnu-emacs

    Although I don't understand this part of code fully, it
    seems that your fix is correct.  Richard, what do you think?
    Shall I install it (both in HEAD and RC)?

It seems correct to me.  Please do install it in HEAD and RC.

Thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Problem with Boyer Moore and Greek characters
@ 2002-05-13  0:12 Kenichi Handa
  2002-05-13 17:00 ` Richard Stallman
  0 siblings, 1 reply; 5+ messages in thread
From: Kenichi Handa @ 2002-05-13  0:12 UTC (permalink / raw)
  Cc: tlm, bug-gnu-emacs

Richard Stallman <rms@gnu.org> writes:
>     Although I don't understand this part of code fully, it
>     seems that your fix is correct.  Richard, what do you think?
>     Shall I install it (both in HEAD and RC)?

> It seems correct to me.  Please do install it in HEAD and RC.

Done.

As I don't know if Mr. Morgan sent an assignment paper to
FSF, and the change is very small, I wrote your name in
ChangeLog as below.

2002-05-13  Richard M. Stallman  <rms@gnu.org>

        * search.c (search_buffer): Give up boyer moore search if inverse
        translation changes charset_base.

And, for RC, as there's no such branch/tag as "RC", I used
EMACS_21_1_RC branch.  Is it ok?

---
Ken'ichi HANDA
handa@etl.go.jp

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Problem with Boyer Moore and Greek characters
  2002-05-13  0:12 Kenichi Handa
@ 2002-05-13 17:00 ` Richard Stallman
  0 siblings, 0 replies; 5+ messages in thread
From: Richard Stallman @ 2002-05-13 17:00 UTC (permalink / raw)
  Cc: tlm, bug-gnu-emacs

    As I don't know if Mr. Morgan sent an assignment paper to
    FSF, and the change is very small, I wrote your name in
    ChangeLog as below.

This change is so small we don't need legal papers for it.
Thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2002-05-13 17:00 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-04-22 23:44 Problem with Boyer Moore and Greek characters Thomas Morgan
  -- strict thread matches above, loose matches on Subject: below --
2002-05-07 13:35 Kenichi Handa
2002-05-12 16:44 ` Richard Stallman
2002-05-13  0:12 Kenichi Handa
2002-05-13 17:00 ` Richard Stallman

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).