From mboxrd@z Thu Jan  1 00:00:00 1970
Path: main.gmane.org!not-for-mail
From: Kenichi Handa <handa@m17n.org>
Newsgroups: gmane.emacs.devel
Subject: Re: utf-8 cjk translation bug?
Date: Tue, 30 Sep 2003 21:59:42 +0900 (JST)
Sender: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org
Message-ID: <200309301259.VAA01304@etlken.m17n.org>
References: <buo7k3q3cgu.fsf@mcspd15.ucom.lsi.nec.co.jp>
NNTP-Posting-Host: deer.gmane.org
Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya")
Content-Type: text/plain; charset=ISO-2022-JP
X-Trace: sea.gmane.org 1064927852 12952 80.91.224.253 (30 Sep 2003 13:17:32 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Tue, 30 Sep 2003 13:17:32 +0000 (UTC)
Cc: d.love@dl.ac.uk, emacs-devel@gnu.org
Original-X-From: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Tue Sep 30 15:17:29 2003
Return-path: <emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org>
Original-Received: from quimby.gnus.org ([80.91.224.244])
	by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian))
	id 1A4KNd-0002Xd-00
	for <emacs-devel@deer.gmane.org>; Tue, 30 Sep 2003 15:17:29 +0200
Original-Received: from monty-python.gnu.org ([199.232.76.173])
	by quimby.gnus.org with esmtp (Exim 3.35 #1 (Debian))
	id 1A4KNd-000650-00
	for <emacs-devel@quimby.gnus.org>; Tue, 30 Sep 2003 15:17:29 +0200
Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org)
	by monty-python.gnu.org with esmtp (Exim 4.22)
	id 1A4KGy-0006Yw-4e
	for emacs-devel@quimby.gnus.org; Tue, 30 Sep 2003 09:10:36 -0400
Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.22)
	id 1A4KGJ-0006YQ-Ka
	for emacs-devel@gnu.org; Tue, 30 Sep 2003 09:09:55 -0400
Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.22)
	id 1A4KFn-0006SF-IR
	for emacs-devel@gnu.org; Tue, 30 Sep 2003 09:09:54 -0400
Original-Received: from [192.47.44.130] (helo=tsukuba.m17n.org)
	by monty-python.gnu.org with esmtp (Exim 4.22)
	id 1A4K6T-0005EO-Fk; Tue, 30 Sep 2003 08:59:45 -0400
Original-Received: from fs.m17n.org (fs.m17n.org [192.47.44.2])
	by tsukuba.m17n.org (8.11.6p2/3.7W-20010518204228) with ESMTP id
	h8UCxh324703; Tue, 30 Sep 2003 21:59:43 +0900 (JST)
	(envelope-from handa@m17n.org)
Original-Received: from etlken.m17n.org (etlken.m17n.org [192.47.44.125])
	by fs.m17n.org (8.11.6/3.7W-20010823150639) with ESMTP id h8UCxg916742; 
	Tue, 30 Sep 2003 21:59:43 +0900 (JST)
Original-Received: (from handa@localhost)
	by etlken.m17n.org (8.8.8+Sun/3.7W-2001040620) id VAA01304;
	Tue, 30 Sep 2003 21:59:42 +0900 (JST)
Original-To: miles@gnu.org
In-reply-to: <buo7k3q3cgu.fsf@mcspd15.ucom.lsi.nec.co.jp> (message from Miles
	Bader on 30 Sep 2003 17:30:25 +0900)
User-Agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2
	Emacs/21.2.92 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI)
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.2
Precedence: list
List-Id: Emacs development discussions.  <emacs-devel.gnu.org>
List-Unsubscribe: <http://mail.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://mail.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://mail.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org
Xref: main.gmane.org gmane.emacs.devel:16793
X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:16793

In article <buo7k3q3cgu.fsf@mcspd15.ucom.lsi.nec.co.jp>, Miles Bader <miles@lsi.nec.co.jp> writes:
> I have `utf-translate-cjk-mode' enabled.
> I have the following string in a buffer:

>         ＮＥＣエレクトロニクス(株)

> If I write it using say `euc-jp' coding system, no problem.  According
> to `C-u C-x =', all the japanese characters are in the charset
> japanese-jisx0208.

> However, if I save it using utf-8, I get no complaints, but when I read
> it back in, the first 3 characters show up as little boxes.  `C-u C-x ='
> shows the boxes as being in charset mule-unicode-e000-ffff; the rest of
> the characters are still listed as being in japanese-jisx0208.

> I presume this is representable utf-8, because unicode is supposed to be
> able to represent all characters in any component character set
> simultaneously, so it would seem to be a bug in utf-translate-cjk-mode.

The first three letters are "FULL WIDTH LATIN ?? LETTER"
(U+FF??).  Yes, they are representable in utf-8.  But, in
subst-jis.el, we have this code:

(mapc
 (lambda (pair)
   (let ((unicode (car pair))
	 (char (cadr pair)))
     ;; exclude non-CJK components from decode table
     (if (and (>= unicode #x2e80) (<= unicode #xd7a3))
	 (puthash unicode  char ucs-unicode-to-mule-cjk))
     (puthash char unicode ucs-mule-cjk-to-unicode)))

So, #xFF?? are excluded from ucs-unicode-to-mule-cjk, thus
they are not translated to japanese-jisx0208 on decoding.
If you have a ISO10646-1 font that contains full width
glyphs for those characters, you can see correct glyphs.

I think the reason why they are excluded from the
translation is that they are representable by the charset
mule-unicode-e000-ffff, thus there's no need of translation.
It seems to be a reasonable decision, but considering that
most users don't have an ISO10646-1 font containing those
glyphs, and that those characters can also be regarded as
CJK components (only CJK users uses them), I think we had
better not exclude them from the translation.

So, I suggest changing the above line (and similar lines in
the other subst-XXX.el) to:

     (if (>= unicode #x2e80)
	 (puthash unicode  char ucs-unicode-to-mule-cjk))

and modify ccl-decode-mule-utf-8 to check translation also
for those characters.

Dave, what do you think?  Does such a change leads to any
problem?  Aren't there anything else we should change?

---
Ken'ichi HANDA
handa@m17n.org