From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Thomas Morgan Newsgroups: gmane.emacs.bugs Subject: Problem with Boyer Moore and Greek characters Date: 22 Apr 2002 19:44:17 -0400 Sender: bug-gnu-emacs-admin@gnu.org Message-ID: NNTP-Posting-Host: localhost.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: main.gmane.org 1019518758 2328 127.0.0.1 (22 Apr 2002 23:39:18 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Mon, 22 Apr 2002 23:39:18 +0000 (UTC) Return-path: Original-Received: from fencepost.gnu.org ([199.232.76.164]) by main.gmane.org with esmtp (Exim 3.33 #1 (Debian)) id 16znOv-0000bR-00 for ; Tue, 23 Apr 2002 01:39:17 +0200 Original-Received: from localhost ([127.0.0.1] helo=fencepost.gnu.org) by fencepost.gnu.org with esmtp (Exim 3.34 #1 (Debian)) id 16znOu-00058B-00; Mon, 22 Apr 2002 19:39:16 -0400 Original-Received: from falcon.mail.pas.earthlink.net ([207.217.120.74] helo=falcon.prod.itd.earthlink.net) by fencepost.gnu.org with esmtp (Exim 3.34 #1 (Debian)) id 16znOI-00055s-00 for ; Mon, 22 Apr 2002 19:38:39 -0400 Original-Received: from user-2inij38.dialup.mindspring.com ([165.121.76.104] helo=zamenhof) by falcon.prod.itd.earthlink.net with esmtp (Exim 3.33 #2) id 16znOE-0003Gc-00 for bug-gnu-emacs@gnu.org; Mon, 22 Apr 2002 16:38:34 -0700 Original-To: bug-gnu-emacs@gnu.org User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.1 Original-Lines: 63 Errors-To: bug-gnu-emacs-admin@gnu.org X-BeenThere: bug-gnu-emacs@gnu.org X-Mailman-Version: 2.0.9 Precedence: bulk List-Help: List-Post: List-Subscribe: , List-Id: Bug reports for GNU Emacs, the Swiss army knife of text editors List-Unsubscribe: , List-Archive: Xref: main.gmane.org gmane.emacs.bugs:877 X-Report-Spam: http://spam.gmane.org/gmane.emacs.bugs:877 I ran GNU Emacs 21.1.1 (i686-pc-linux-gnu, X toolkit) with the options `--q --no-site-file', then typed the following into `*scratch*': (search-forward "=E1=BD=B7") =E1=BD=BB (The first Greek character is an accented iota represented in Emacs by the character number 342199, and the second is an accented upsilon represented by 342203. I entered them with the input method `greek-ibycus4'.) Then I pressed `C-p' and `C-e' to move point to the end of the first line, and `C-x C-e' to evaluate the expression. Here is the exact input for all of that: ( s e a r c h - f o r w a r d SPC " C-x C-\=20 g r e e k - i b y c u s 4 i ' C-\ " ) =20 C-\ u ' C-\ C-p C-e C-x C-e This moved the cursor to the end of the second line, and displayed `214', the new position of point, in the echo area. So searching for the iota found the upsilon. This must be a bug. Boyer Moore searching compares only the last bytes of the characters, and this leads to the problem. If you capitalize the accented iota, the last byte is the same as the last byte of the upsilon, although their second-to-last bytes are different. Capital accented iota \234\364\362\273 Small accented upsilon \234\364\361\273 So before doing a Boyer Moore search, `search_buffer' needs to check that the character and its inversion have the same first three bytes. Here is the patch I made to do that. Please forgive my mistakes; I am not a programmer. cd ~/emacs-21.1/src/ diff -c /home/tlm/emacs-21.1/src/search.c.\~1\~ /home/tlm/emacs-21.1/src/se= arch.c *** /home/tlm/emacs-21.1/src/search.c.~1~ Mon Oct 1 02:08:20 2001 --- /home/tlm/emacs-21.1/src/search.c Wed Apr 3 07:53:39 2002 *************** *** 1237,1243 **** /* Keep track of which character set row contains the characters that need translation. */ int charset_base_code =3D c & ~CHAR_FIELD3_MASK; ! if (charset_base =3D=3D -1) charset_base =3D charset_base_code; else if (charset_base !=3D charset_base_code) /* If two different rows appear, needing translation, --- 1237,1246 ---- /* Keep track of which character set row contains the characters that need translation. */ int charset_base_code =3D c & ~CHAR_FIELD3_MASK; ! int inverse_charset_base =3D inverse & ~CHAR_FIELD3_MASK; ! if (charset_base_code !=3D inverse_charset_base) ! boyer_moore_ok =3D 0; ! else if (charset_base =3D=3D -1) charset_base =3D charset_base_code; else if (charset_base !=3D charset_base_code) /* If two different rows appear, needing translation, Diff finished at Wed Apr 3 08:00:10