* Problem with Boyer Moore and Greek characters
@ 2002-04-22 23:44 Thomas Morgan
0 siblings, 0 replies; 5+ messages in thread
From: Thomas Morgan @ 2002-04-22 23:44 UTC (permalink / raw)
I ran GNU Emacs 21.1.1 (i686-pc-linux-gnu, X toolkit) with the options
`--q --no-site-file', then typed the following into `*scratch*':
(search-forward "ί")
ύ
(The first Greek character is an accented iota represented in Emacs by
the character number 342199, and the second is an accented upsilon
represented by 342203. I entered them with the input method
`greek-ibycus4'.)
Then I pressed `C-p' and `C-e' to move point to the end of the first
line, and `C-x C-e' to evaluate the expression.
Here is the exact input for all of that:
( s e a r c h - f o r w a r d SPC " C-x <return> C-\
g r e e k - i b y c u s 4 <return> i ' C-\ " ) <return>
C-\ u ' C-\ C-p C-e C-x C-e
This moved the cursor to the end of the second line, and displayed
`214', the new position of point, in the echo area. So searching for
the iota found the upsilon. This must be a bug.
Boyer Moore searching compares only the last bytes of the characters,
and this leads to the problem. If you capitalize the accented iota,
the last byte is the same as the last byte of the upsilon, although
their second-to-last bytes are different.
Capital accented iota \234\364\362\273
Small accented upsilon \234\364\361\273
So before doing a Boyer Moore search, `search_buffer' needs to check
that the character and its inversion have the same first three bytes.
Here is the patch I made to do that. Please forgive my mistakes; I am
not a programmer.
cd ~/emacs-21.1/src/
diff -c /home/tlm/emacs-21.1/src/search.c.\~1\~ /home/tlm/emacs-21.1/src/search.c
*** /home/tlm/emacs-21.1/src/search.c.~1~ Mon Oct 1 02:08:20 2001
--- /home/tlm/emacs-21.1/src/search.c Wed Apr 3 07:53:39 2002
***************
*** 1237,1243 ****
/* Keep track of which character set row
contains the characters that need translation. */
int charset_base_code = c & ~CHAR_FIELD3_MASK;
! if (charset_base == -1)
charset_base = charset_base_code;
else if (charset_base != charset_base_code)
/* If two different rows appear, needing translation,
--- 1237,1246 ----
/* Keep track of which character set row
contains the characters that need translation. */
int charset_base_code = c & ~CHAR_FIELD3_MASK;
! int inverse_charset_base = inverse & ~CHAR_FIELD3_MASK;
! if (charset_base_code != inverse_charset_base)
! boyer_moore_ok = 0;
! else if (charset_base == -1)
charset_base = charset_base_code;
else if (charset_base != charset_base_code)
/* If two different rows appear, needing translation,
Diff finished at Wed Apr 3 08:00:10
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Problem with Boyer Moore and Greek characters
@ 2002-05-07 13:35 Kenichi Handa
2002-05-12 16:44 ` Richard Stallman
0 siblings, 1 reply; 5+ messages in thread
From: Kenichi Handa @ 2002-05-07 13:35 UTC (permalink / raw)
Cc: bug-gnu-emacs
Sorry for the late reply on this matter.
Although I don't understand this part of code fully, it
seems that your fix is correct. Richard, what do you think?
Shall I install it (both in HEAD and RC)?
---
Ken'ichi HANDA
handa@etl.go.jp
Thomas Morgan <tlm@pocketmail.com> writes:
> I ran GNU Emacs 21.1.1 (i686-pc-linux-gnu, X toolkit) with the options
> `--q --no-site-file', then typed the following into `*scratch*':
> (search-forward "ί")
> ύ
> (The first Greek character is an accented iota represented in Emacs by
> the character number 342199, and the second is an accented upsilon
> represented by 342203. I entered them with the input method
> `greek-ibycus4'.)
> Then I pressed `C-p' and `C-e' to move point to the end of the first
> line, and `C-x C-e' to evaluate the expression.
> Here is the exact input for all of that:
> ( s e a r c h - f o r w a r d SPC " C-x <return> C-\
> g r e e k - i b y c u s 4 <return> i ' C-\ " ) <return>
> C-\ u ' C-\ C-p C-e C-x C-e
> This moved the cursor to the end of the second line, and displayed
> `214', the new position of point, in the echo area. So searching for
> the iota found the upsilon. This must be a bug.
> Boyer Moore searching compares only the last bytes of the characters,
> and this leads to the problem. If you capitalize the accented iota,
> the last byte is the same as the last byte of the upsilon, although
> their second-to-last bytes are different.
> Capital accented iota \234\364\362\273
> Small accented upsilon \234\364\361\273
> So before doing a Boyer Moore search, `search_buffer' needs to check
> that the character and its inversion have the same first three bytes.
> Here is the patch I made to do that. Please forgive my mistakes; I am
> not a programmer.
> cd ~/emacs-21.1/src/
> diff -c /home/tlm/emacs-21.1/src/search.c.\~1\~ /home/tlm/emacs-21.1/src/search.c
> *** /home/tlm/emacs-21.1/src/search.c.~1~ Mon Oct 1 02:08:20 2001
> --- /home/tlm/emacs-21.1/src/search.c Wed Apr 3 07:53:39 2002
> ***************
> *** 1237,1243 ****
> /* Keep track of which character set row
> contains the characters that need translation. */
> int charset_base_code = c & ~CHAR_FIELD3_MASK;
> ! if (charset_base == -1)
> charset_base = charset_base_code;
> else if (charset_base != charset_base_code)
> /* If two different rows appear, needing translation,
> --- 1237,1246 ----
> /* Keep track of which character set row
> contains the characters that need translation. */
> int charset_base_code = c & ~CHAR_FIELD3_MASK;
> ! int inverse_charset_base = inverse & ~CHAR_FIELD3_MASK;
> ! if (charset_base_code != inverse_charset_base)
> ! boyer_moore_ok = 0;
> ! else if (charset_base == -1)
> charset_base = charset_base_code;
> else if (charset_base != charset_base_code)
> /* If two different rows appear, needing translation,
> Diff finished at Wed Apr 3 08:00:10
> _______________________________________________
> Bug-gnu-emacs mailing list
> Bug-gnu-emacs@gnu.org
> http://mail.gnu.org/mailman/listinfo/bug-gnu-emacs
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Problem with Boyer Moore and Greek characters
2002-05-07 13:35 Problem with Boyer Moore and Greek characters Kenichi Handa
@ 2002-05-12 16:44 ` Richard Stallman
0 siblings, 0 replies; 5+ messages in thread
From: Richard Stallman @ 2002-05-12 16:44 UTC (permalink / raw)
Cc: tlm, bug-gnu-emacs
Although I don't understand this part of code fully, it
seems that your fix is correct. Richard, what do you think?
Shall I install it (both in HEAD and RC)?
It seems correct to me. Please do install it in HEAD and RC.
Thanks.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Problem with Boyer Moore and Greek characters
@ 2002-05-13 0:12 Kenichi Handa
2002-05-13 17:00 ` Richard Stallman
0 siblings, 1 reply; 5+ messages in thread
From: Kenichi Handa @ 2002-05-13 0:12 UTC (permalink / raw)
Cc: tlm, bug-gnu-emacs
Richard Stallman <rms@gnu.org> writes:
> Although I don't understand this part of code fully, it
> seems that your fix is correct. Richard, what do you think?
> Shall I install it (both in HEAD and RC)?
> It seems correct to me. Please do install it in HEAD and RC.
Done.
As I don't know if Mr. Morgan sent an assignment paper to
FSF, and the change is very small, I wrote your name in
ChangeLog as below.
2002-05-13 Richard M. Stallman <rms@gnu.org>
* search.c (search_buffer): Give up boyer moore search if inverse
translation changes charset_base.
And, for RC, as there's no such branch/tag as "RC", I used
EMACS_21_1_RC branch. Is it ok?
---
Ken'ichi HANDA
handa@etl.go.jp
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Problem with Boyer Moore and Greek characters
2002-05-13 0:12 Kenichi Handa
@ 2002-05-13 17:00 ` Richard Stallman
0 siblings, 0 replies; 5+ messages in thread
From: Richard Stallman @ 2002-05-13 17:00 UTC (permalink / raw)
Cc: tlm, bug-gnu-emacs
As I don't know if Mr. Morgan sent an assignment paper to
FSF, and the change is very small, I wrote your name in
ChangeLog as below.
This change is so small we don't need legal papers for it.
Thanks.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2002-05-13 17:00 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-05-07 13:35 Problem with Boyer Moore and Greek characters Kenichi Handa
2002-05-12 16:44 ` Richard Stallman
-- strict thread matches above, loose matches on Subject: below --
2002-05-13 0:12 Kenichi Handa
2002-05-13 17:00 ` Richard Stallman
2002-04-22 23:44 Thomas Morgan
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.