Same non-ASCII characters not 'equal'

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* Same non-ASCII characters not 'equal'
@ 2006-08-13 11:44 Sebastian Tennant
  2006-08-15  1:20 ` James Cloos
  0 siblings, 1 reply; 7+ messages in thread
From: Sebastian Tennant @ 2006-08-13 11:44 UTC (permalink / raw)


Hello all,

I'm trying to write a little vocab tester but I've stumbled upon some
strange behaviour I can't figure out.

For some reason the following code does not match strings containing
special characters (i.e., non-ASCII characters input using an input
method)?

 (with-temp-buffer
   (set-input-method 'turkish-postfix)
   (let ((dict (list '("glass" "bardak") '("house" "ev") '("girl" "kız")
                     '("child" "çocuk") '("little" "küçük") '("good" "iyi")
                     '("bad" "fena") '("horse" "at") '("this" "bu")))
	 (input (read-from-minibuffer "? " nil nil nil nil nil t))
	 match)
     (dolist (each dict (and match (message "Equal")))
       (when (member input each) (setq match t)))))

Take 'child' and 'çocuk' for instance.  Because the (turkish-postfix)
input method is inherited in the minibuffer you have to type 
'c h i 2 l d' to enter 'child' and a match is found, but when you
enter 'çocuk' by typing 'c , o c u k', no match is found.  Could this
be a bug even?

sebyte

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Same non-ASCII characters not 'equal'
  2006-08-13 11:44 Same non-ASCII characters not 'equal' Sebastian Tennant
@ 2006-08-15  1:20 ` James Cloos
  2006-08-17  7:23   ` Sebastian Tennant
  0 siblings, 1 reply; 7+ messages in thread
From: James Cloos @ 2006-08-15  1:20 UTC (permalink / raw)
  Cc: help-gnu-emacs

>>>>> "Sebastian" == Sebastian Tennant <sebyte@smolny.plus.com> writes:

Sebastian> Take 'child' and 'çocuk' for instance.  Because the (turkish-postfix)
Sebastian> input method is inherited in the minibuffer you have to type 
Sebastian> 'c h i 2 l d' to enter 'child' and a match is found, but when you
Sebastian> enter 'çocuk' by typing 'c , o c u k', no match is found.  Could this
Sebastian> be a bug even?

Emacs versions other than the emacs-unicode-2 branch store each of the
iso-8859-x glyphsets separately.  You are probably ending up with the
8859-1 (Latin 1) version of U+00E7 LATIN SMALL LETTER C WITH CEDILLA
in the elisp; using the turkish-postfix input method most likely uses
8859-9 (Latin 5).  

One way to make latin1’s ç and latin5’s ç match is to use one or both
of unify-8859-on-decoding-mode and/or unify-8859-on-encoding-mode.

Or, make sure you use the same encoding to enter the elisp that your
users will use.  There are commands to convert the current buffer to
a different encoding.  

Since I’ve moved almost exclusively to the unicode-2 branch, I don’t
remember the specifics of the unify-8859 modes, but they are documented
in info.

-JimC  (who has been caught by this issue before)
-- 
James Cloos <cloos@jhcloos.com>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Same non-ASCII characters not 'equal'
  2006-08-15  1:20 ` James Cloos
@ 2006-08-17  7:23   ` Sebastian Tennant
  2006-08-17 16:49     ` James Cloos
  0 siblings, 1 reply; 7+ messages in thread
From: Sebastian Tennant @ 2006-08-17  7:23 UTC (permalink / raw)


Quoth James Cloos <cloos@jhcloos.com>:
>>>>>> "Sebastian" == Sebastian Tennant <sebyte@smolny.plus.com> writes:
> Sebastian> Take 'child' and 'çocuk' for instance.  Because the (turkish-postfix)
> Sebastian> input method is inherited in the minibuffer you have to type 
> Sebastian> 'c h i 2 l d' to enter 'child' and a match is found, but when you
> Sebastian> enter 'çocuk' by typing 'c , o c u k', no match is found.  Could this
> Sebastian> be a bug even?
>
> Emacs versions other than the emacs-unicode-2 branch store each of the
> iso-8859-x glyphsets separately.  You are probably ending up with the
> 8859-1 (Latin 1) version of U+00E7 LATIN SMALL LETTER C WITH CEDILLA
> in the elisp; using the turkish-postfix input method most likely uses
> 8859-9 (Latin 5).  

I don't think this is the problem as I'm working with a unicode
terminal, and the encodings used for read and write are mule-utf-8

> One way to make latin1’s ç and latin5’s ç match is to use one or both
> of unify-8859-on-decoding-mode and/or unify-8859-on-encoding-mode.

I've tried setting these variables in the temporary buffer, without
success.

> Or, make sure you use the same encoding to enter the elisp that your
> users will use.  There are commands to convert the current buffer to
> a different encoding.  

Everything is mule-utf-8.

> Since I’ve moved almost exclusively to the unicode-2 branch, I don’t
> remember the specifics of the unify-8859 modes, but they are documented
> in info.

I'm not sure what you mean by unicode-2 branch

(emacs-version)
"GNU Emacs 21.4.1 (i486-pc-linux-gnu)
 of 2006-05-15 on trouble, modified by Debian"

> -JimC  (who has been caught by this issue before)

Thanks for your help Jim, but I'm still stuck :-(

I've managed to establish that the problem is caused by either the
read or write to disk, or both.  If the dictionary is defined in the
function, matches are found without a problem.  It's only when the
dictionary is populated from disk when matches of non-ASCII characters
fail.

Can you think of anything else I can try?

Perhaps a few variable checks in the code, to help diagnose the
problem?

Sebastian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Same non-ASCII characters not 'equal'
  2006-08-17  7:23   ` Sebastian Tennant
@ 2006-08-17 16:49     ` James Cloos
  2006-08-21 12:25       ` Sebastian Tennant
  0 siblings, 1 reply; 7+ messages in thread
From: James Cloos @ 2006-08-17 16:49 UTC (permalink / raw)
  Cc: help-gnu-emacs

>>>>> "Sebastian" == Sebastian Tennant <sebyte@smolny.plus.com>
>>>>> writes:

>> You are probably ending up with the 8859-1 (Latin 1) version of
>> U+00E7 LATIN SMALL LETTER C WITH CEDILLA in the elisp; using the
>> turkish-postfix input method most likely uses 8859-9 (Latin 5).

Sebastian> I don't think this is the problem as I'm working with a
Sebastian> unicode terminal, and the encodings used for read and write
Sebastian> are mule-utf-8

It is still possible that turkish-postfix generates a latin5 ç rather
than a mule-utf-8 ç.  But, no.  I get the same buffer code when using
turkish-postfix as when using X’s “<Multi_key> <,> <c>”.  (That w/o
anything interesting in ~/.emacs but with LANG=en_US-UTF8 run on a sid
box with emacs-snapshot-nox installed via apt.)

Sebastian> Everything is mule-utf-8.

Then my guess was a red herring.  I don’t know what the problem is.

>> Since I’ve moved almost exclusively to the unicode-2 branch, I
>> don’t remember the specifics of the unify-8859 modes, but they are
>> documented in info.

Sebastian> I'm not sure what you mean by unicode-2 branch

Sebastian> (emacs-version) "GNU Emacs 21.4.1 (i486-pc-linux-gnu)
Sebastian>  of 2006-05-15 on trouble, modified by Debian"

The unicode-2 branch is a branch of the Emacs CVS repository.  You can
grab it from cvs by using:

cvs -d :pserver:cvs.savannah.gnu.org:/cvsroot/emacs co -r emacs-unicode-2 emacs

instead of using:

cvs -d :pserver:cvs.savannah.gnu.org:/cvsroot/emacs co emacs

which grabs the HEAD branch.

The HEAD branch is to be released as Emacs-22.  The emacs-unicode-2
branch is likely to be the basis of the Emacs-23 release.

On debian, you can get a compile of snapshots of the HEAD branch by
installing emacs-snapshot, emacs-snapshot-nox or emacs-snapshot-gtk
rather than using emacs, emacs-nox, emacs21 or emacs21-nox.  (On sid
what you are running would be emacs21 or emacs21-nox, as applicable.
What is emacs21 on sid *may* be just emacs on sarge.  I’m not sure
about etch.  Emacs-snapshot *might* handle this better than emacs21
does.  Or it might not.  I’m confident that the unicode-2 branch,
however, will get it right.  But on debian you’ll have to compile it
yourself.  (On ubuntu, emacs-snapshot is certainly available for edgy
and — I *think* — for dapper; I’ve not tried anything older than that.)

Sebastian> I've managed to establish that the problem is caused by
Sebastian> either the read or write to disk, or both.  If the
Sebastian> dictionary is defined in the function, matches are found
Sebastian> without a problem.  It's only when the dictionary is
Sebastian> populated from disk when matches of non-ASCII characters
Sebastian> fail.

Try running (describe-char) with the point on the offending characters
in the buffer containing the data as read from disk.  If they don’t
match what you get from (describe-char) on the freshly keyboard-input
characters then my guess was on the mark after all.  Or at least in
the same ballpark ☺ — or the same football pitch, if you prefer.

If that is the case, I presume you need to set the coding-system for
reading in the dictionary data as mule-utf-8.  

-JimC
-- 
James Cloos <cloos@jhcloos.com>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Same non-ASCII characters not 'equal'
  2006-08-17 16:49     ` James Cloos
@ 2006-08-21 12:25       ` Sebastian Tennant
  2006-08-21 13:15         ` James Cloos
  0 siblings, 1 reply; 7+ messages in thread
From: Sebastian Tennant @ 2006-08-21 12:25 UTC (permalink / raw)
  Cc: help-gnu-emacs

Quoth James Cloos <cloos@jhcloos.com>:
> Then my guess was a red herring.  I don’t know what the problem is.
>
> On debian, you can get a compile of snapshots of the HEAD branch by
> installing emacs-snapshot, emacs-snapshot-nox or emacs-snapshot-gtk
> rather than using emacs, emacs-nox, emacs21 or emacs21-nox.  (On sid
> what you are running would be emacs21 or emacs21-nox, as applicable.
> What is emacs21 on sid *may* be just emacs on sarge.  I’m not sure
> about etch.  Emacs-snapshot *might* handle this better than emacs21
> does.  Or it might not.  I’m confident that the unicode-2 branch,
> however, will get it right.  But on debian you’ll have to compile it
> yourself.  (On ubuntu, emacs-snapshot is certainly available for edgy
> and — I *think* — for dapper; I’ve not tried anything older than that.)

I've installed emacs-snapshot-nox from sid on my predominately etch
box... and the problem is solved :-)

Thanks for your assistance Jim.

Sebastian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Same non-ASCII characters not 'equal'
  2006-08-21 12:25       ` Sebastian Tennant
@ 2006-08-21 13:15         ` James Cloos
  0 siblings, 0 replies; 7+ messages in thread
From: James Cloos @ 2006-08-21 13:15 UTC (permalink / raw)
  Cc: Sebastian Tennant

>>>>> "Sebastian" == Sebastian Tennant <sebyte@smolny.plus.com> writes:

Sebastian> I've installed emacs-snapshot-nox from sid on my
Sebastian> predominately etch box... and the problem is solved :-)

Sebastian> Thanks for your assistance Jim.

Glad I could help!

-JimC
-- 
James Cloos <cloos@jhcloos.com>

^ permalink raw reply	[flat|nested] 7+ messages in thread

[parent not found: <mailman.5138.1155469477.9609.help-gnu-emacs@gnu.org>]

* Re: Same non-ASCII characters not 'equal'
       [not found] <mailman.5138.1155469477.9609.help-gnu-emacs@gnu.org>
@ 2006-08-13 16:56 ` Pascal Bourguignon
  0 siblings, 0 replies; 7+ messages in thread
From: Pascal Bourguignon @ 2006-08-13 16:56 UTC (permalink / raw)


Sebastian Tennant <sebyte@smolny.plus.com> writes:

> Hello all,
>
> I'm trying to write a little vocab tester but I've stumbled upon some
> strange behaviour I can't figure out.
>
> For some reason the following code does not match strings containing
> special characters (i.e., non-ASCII characters input using an input
> method)?
>
>  (with-temp-buffer
>    (set-input-method 'turkish-postfix)
>    (let ((dict (list '("glass" "bardak") '("house" "ev") '("girl" "kız")
>                      '("child" "çocuk") '("little" "küçük") '("good" "iyi")
>                      '("bad" "fena") '("horse" "at") '("this" "bu")))
> 	 (input (read-from-minibuffer "? " nil nil nil nil nil t))
> 	 match)
>      (dolist (each dict (and match (message "Equal")))
>        (when (member input each) (setq match t)))))
>
> Take 'child' and 'çocuk' for instance.  Because the (turkish-postfix)
> input method is inherited in the minibuffer you have to type 
> 'c h i 2 l d' to enter 'child' and a match is found, but when you
> enter 'çocuk' by typing 'c , o c u k', no match is found.  Could this
> be a bug even?

It works for me.

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

WARNING: This product warps space and time in its vicinity.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-08-21 13:15 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-13 11:44 Same non-ASCII characters not 'equal' Sebastian Tennant
2006-08-15  1:20 ` James Cloos
2006-08-17  7:23   ` Sebastian Tennant
2006-08-17 16:49     ` James Cloos
2006-08-21 12:25       ` Sebastian Tennant
2006-08-21 13:15         ` James Cloos
     [not found] <mailman.5138.1155469477.9609.help-gnu-emacs@gnu.org>
2006-08-13 16:56 ` Pascal Bourguignon

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.