From mboxrd@z Thu Jan  1 00:00:00 1970
Path: main.gmane.org!not-for-mail
From: Kenichi Handa <handa@m17n.org>
Newsgroups: gmane.emacs.devel
Subject: Re: utf-8 cjk translation bug?
Date: Thu, 2 Oct 2003 10:08:15 +0900 (JST)
Sender: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org
Message-ID: <200310020108.KAA03803@etlken.m17n.org>
References: <buo7k3q3cgu.fsf@mcspd15.ucom.lsi.nec.co.jp>	<200309301259.VAA01304@etlken.m17n.org>
	<rzq65j9nn4j.fsf@albion.dl.ac.uk>
NNTP-Posting-Host: deer.gmane.org
Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya")
Content-Type: text/plain; charset=ISO-2022-JP
X-Trace: sea.gmane.org 1065057809 11707 80.91.224.253 (2 Oct 2003 01:23:29 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Thu, 2 Oct 2003 01:23:29 +0000 (UTC)
Cc: emacs-devel@gnu.org, miles@gnu.org
Original-X-From: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Thu Oct 02 03:23:26 2003
Return-path: <emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org>
Original-Received: from quimby.gnus.org ([80.91.224.244])
	by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian))
	id 1A4sBi-0007yi-00
	for <emacs-devel@deer.gmane.org>; Thu, 02 Oct 2003 03:23:26 +0200
Original-Received: from monty-python.gnu.org ([199.232.76.173])
	by quimby.gnus.org with esmtp (Exim 3.35 #1 (Debian))
	id 1A4sBh-0004Qv-00
	for <emacs-devel@quimby.gnus.org>; Thu, 02 Oct 2003 03:23:25 +0200
Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org)
	by monty-python.gnu.org with esmtp (Exim 4.24)
	id 1A4ryy-0002db-2H
	for emacs-devel@quimby.gnus.org; Wed, 01 Oct 2003 21:10:16 -0400
Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.24)
	id 1A4ryC-0002aP-TN
	for emacs-devel@gnu.org; Wed, 01 Oct 2003 21:09:28 -0400
Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.24)
	id 1A4rxf-0002VE-RQ
	for emacs-devel@gnu.org; Wed, 01 Oct 2003 21:09:26 -0400
Original-Received: from [192.47.44.130] (helo=tsukuba.m17n.org)
	by monty-python.gnu.org with esmtp (Exim 4.24)
	id 1A4rx8-0002R0-W8; Wed, 01 Oct 2003 21:08:23 -0400
Original-Received: from fs.m17n.org (fs.m17n.org [192.47.44.2])
	by tsukuba.m17n.org (8.11.6p2/3.7W-20010518204228) with ESMTP id
	h9218F311666; Thu, 2 Oct 2003 10:08:15 +0900 (JST)
	(envelope-from handa@m17n.org)
Original-Received: from etlken.m17n.org (etlken.m17n.org [192.47.44.125])
	by fs.m17n.org (8.11.6/3.7W-20010823150639) with ESMTP id h9218F928916; 
	Thu, 2 Oct 2003 10:08:15 +0900 (JST)
Original-Received: (from handa@localhost)
	by etlken.m17n.org (8.8.8+Sun/3.7W-2001040620) id KAA03803;
	Thu, 2 Oct 2003 10:08:15 +0900 (JST)
Original-To: d.love@dl.ac.uk
In-reply-to: <rzq65j9nn4j.fsf@albion.dl.ac.uk> (message from Dave Love on Wed, 
	01 Oct 2003 13:44:28 +0100)
User-Agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2
	Emacs/21.2.92 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI)
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.2
Precedence: list
List-Id: Emacs development discussions.  <emacs-devel.gnu.org>
List-Unsubscribe: <http://mail.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://mail.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://mail.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org
Xref: main.gmane.org gmane.emacs.devel:16842
X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:16842

In article <rzq65j9nn4j.fsf@albion.dl.ac.uk>, Dave Love <d.love@dl.ac.uk> writes:

>>  I think the reason why they are excluded from the
>>  translation is that they are representable by the charset
>>  mule-unicode-e000-ffff, thus there's no need of translation.

> That was part of the reason for it -- the hash-based translation code
> is only relevant because we more-or-less used up the code space for
> the BMP.  I also chose the boundaries to avoid breaking the region
> between the mule-unicode and CJK charsets.

Sorry, I don't understand the meaning of the last sentence.

>>  It seems to be a reasonable decision, but considering that
>>  most users don't have an ISO10646-1 font containing those
>>  glyphs,

> I thought they typically did if they had 10646 fonts at all.  Is the
> problem that in recent XFree86, for instance, the double-width
> characters are in different fonts which have `adstyl' `ja' or `ko'?

Ah, right, they have double-width glyphs for those chars.
But, I think there are still many those who are not using
the recent XFree86, or who have not installed those fonts.

> As far as I remember, the fontset code doesn't deal with that yet.
> (So many special cases, sigh.)

Right.  So, even for XFree86 users, to utilize those fonts,
we need extra work.

>>  and that those characters can also be regarded as
>>  CJK components (only CJK users uses them), I think we had
>>  better not exclude them from the translation.

> I'm not really convinced, but I don't feel strongly about it.  (If the
> extra charsets hadn't been added before mule-unicode, we'd just have
> covered the BMP with more mule-unicode ones.)

And if I knew it took that long time to release the code
that contains mule-unicode charsets, I'd implemented a
single 3-dimensional charset that covers almost all Unicode
characters (Charset-ID 159 is not yet used).

>>  So, I suggest changing the above line (and similar lines in
>>  the other subst-XXX.el) to:
>> 
>>       (if (>= unicode #x2e80)
>>  	 (puthash unicode  char ucs-unicode-to-mule-cjk))
>> 
>>  and modify ccl-decode-mule-utf-8 to check translation also
>>  for those characters.
>> 
>>  Dave, what do you think?  Does such a change leads to any
>>  problem?

> As far as I remember, it includes too much, and you end up displaying
> some characters double width that probably shouldn't be, but I don't
> remember which.  How about including the ranges of the double-width
> Western characters and the high CJK stuff explicitly?  I guess it
> doesn't expand the tables greatly.

Ok, I've just installed a code that include U+FF00..U+FFEF
in the decode tables.

Now, in utf-translate-cjk mode:

(decode-coding-string
 (encode-coding-string "ＮＥＣエレクトロニクス(株)" 'utf-8)
 'utf-8)
=> "ＮＥＣエレクトロニクス(株)"

---
Ken'ichi HANDA
handa@m17n.org