Problem with Boyer Moore and Greek characters

* Problem with Boyer Moore and Greek characters
@ 2002-04-22 23:44 Thomas Morgan
  0 siblings, 0 replies; 5+ messages in thread
From: Thomas Morgan @ 2002-04-22 23:44 UTC (permalink / raw)

I ran GNU Emacs 21.1.1 (i686-pc-linux-gnu, X toolkit) with the options
`--q --no-site-file', then typed the following into `*scratch*':

  (search-forward "ί")
  ύ

(The first Greek character is an accented iota represented in Emacs by
the character number 342199, and the second is an accented upsilon
represented by 342203.  I entered them with the input method
`greek-ibycus4'.)

Then I pressed `C-p' and `C-e' to move point to the end of the first
line, and `C-x C-e' to evaluate the expression.

Here is the exact input for all of that:

( s e a r c h - f o r w a r d SPC " C-x <return> C-\ 
g r e e k - i b y c u s 4 <return> i ' C-\ " ) <return> 
C-\ u ' C-\ C-p C-e C-x C-e

This moved the cursor to the end of the second line, and displayed
`214', the new position of point, in the echo area.  So searching for
the iota found the upsilon.  This must be a bug.

Boyer Moore searching compares only the last bytes of the characters,
and this leads to the problem.  If you capitalize the accented iota,
the last byte is the same as the last byte of the upsilon, although
their second-to-last bytes are different.

Capital accented iota	\234\364\362\273
Small accented upsilon	\234\364\361\273

So before doing a Boyer Moore search, `search_buffer' needs to check
that the character and its inversion have the same first three bytes.
Here is the patch I made to do that.  Please forgive my mistakes; I am
not a programmer.

cd ~/emacs-21.1/src/
diff -c /home/tlm/emacs-21.1/src/search.c.\~1\~ /home/tlm/emacs-21.1/src/search.c
*** /home/tlm/emacs-21.1/src/search.c.~1~	Mon Oct  1 02:08:20 2001
--- /home/tlm/emacs-21.1/src/search.c	Wed Apr  3 07:53:39 2002
***************
*** 1237,1243 ****
  		  /* Keep track of which character set row
  		     contains the characters that need translation.  */
  		  int charset_base_code = c & ~CHAR_FIELD3_MASK;
! 		  if (charset_base == -1)
  		    charset_base = charset_base_code;
  		  else if (charset_base != charset_base_code)
  		    /* If two different rows appear, needing translation,
--- 1237,1246 ----
  		  /* Keep track of which character set row
  		     contains the characters that need translation.  */
  		  int charset_base_code = c & ~CHAR_FIELD3_MASK;
! 		  int inverse_charset_base = inverse & ~CHAR_FIELD3_MASK;
! 		  if (charset_base_code != inverse_charset_base)
! 		    boyer_moore_ok = 0;
! 		  else if (charset_base == -1)
  		    charset_base = charset_base_code;
  		  else if (charset_base != charset_base_code)
  		    /* If two different rows appear, needing translation,

Diff finished at Wed Apr  3 08:00:10

^ permalink raw reply	[flat|nested] 5+ messages in thread