From mboxrd@z Thu Jan  1 00:00:00 1970
Path: main.gmane.org!not-for-mail
From: Dave Love <d.love@dl.ac.uk>
Newsgroups: gmane.emacs.devel
Subject: Re: utf-8 cjk translation bug?
Date: Wed, 01 Oct 2003 13:44:28 +0100
Sender: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org
Message-ID: <rzq65j9nn4j.fsf@albion.dl.ac.uk>
References: <buo7k3q3cgu.fsf@mcspd15.ucom.lsi.nec.co.jp>
	<200309301259.VAA01304@etlken.m17n.org>
NNTP-Posting-Host: deer.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: sea.gmane.org 1065063299 18343 80.91.224.253 (2 Oct 2003 02:54:59 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Thu, 2 Oct 2003 02:54:59 +0000 (UTC)
Cc: emacs-devel@gnu.org, miles@gnu.org
Original-X-From: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Thu Oct 02 04:54:57 2003
Return-path: <emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org>
Original-Received: from quimby.gnus.org ([80.91.224.244])
	by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian))
	id 1A4tcG-0002cc-00
	for <emacs-devel@deer.gmane.org>; Thu, 02 Oct 2003 04:54:56 +0200
Original-Received: from monty-python.gnu.org ([199.232.76.173])
	by quimby.gnus.org with esmtp (Exim 3.35 #1 (Debian))
	id 1A4tcG-0004ih-00
	for <emacs-devel@quimby.gnus.org>; Thu, 02 Oct 2003 04:54:56 +0200
Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org)
	by monty-python.gnu.org with esmtp (Exim 4.24)
	id 1A4taw-00007Y-Qg
	for emacs-devel@quimby.gnus.org; Wed, 01 Oct 2003 22:53:34 -0400
Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.24)
	id 1A4tF6-0003Dr-4h
	for emacs-devel@gnu.org; Wed, 01 Oct 2003 22:31:00 -0400
Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.24)
	id 1A4sxN-00082P-6D
	for emacs-devel@gnu.org; Wed, 01 Oct 2003 22:13:12 -0400
Original-Received: from [148.79.80.39] (helo=albion.dl.ac.uk)
	by monty-python.gnu.org with esmtp (Exim 4.24)
	id 1A4gLJ-0003ZM-Me; Wed, 01 Oct 2003 08:44:34 -0400
Original-Received: from fx by albion.dl.ac.uk with local (Exim 3.35 #1 (Debian))
	id 1A4gLE-0003z6-00; Wed, 01 Oct 2003 13:44:28 +0100
Original-To: Kenichi Handa <handa@m17n.org>
User-Agent: Gnus/5.1003 (Gnus v5.10.3) Emacs/21.2 (gnu/linux)
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.2
Precedence: list
List-Id: Emacs development discussions.  <emacs-devel.gnu.org>
List-Unsubscribe: <http://mail.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://mail.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://mail.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org
Xref: main.gmane.org gmane.emacs.devel:16849
X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:16849

Kenichi Handa <handa@m17n.org> writes:

> So, #xFF?? are excluded from ucs-unicode-to-mule-cjk, thus
> they are not translated to japanese-jisx0208 on decoding.
> If you have a ISO10646-1 font that contains full width
> glyphs for those characters, you can see correct glyphs.

Or you can display them with a jisx font, for instance.

> I think the reason why they are excluded from the
> translation is that they are representable by the charset
> mule-unicode-e000-ffff, thus there's no need of translation.

That was part of the reason for it -- the hash-based translation code
is only relevant because we more-or-less used up the code space for
the BMP.  I also chose the boundaries to avoid breaking the region
between the mule-unicode and CJK charsets.

> It seems to be a reasonable decision, but considering that
> most users don't have an ISO10646-1 font containing those
> glyphs,

I thought they typically did if they had 10646 fonts at all.  Is the
problem that in recent XFree86, for instance, the double-width
characters are in different fonts which have `adstyl' `ja' or `ko'?
As far as I remember, the fontset code doesn't deal with that yet.
(So many special cases, sigh.)

> and that those characters can also be regarded as
> CJK components (only CJK users uses them), I think we had
> better not exclude them from the translation.

I'm not really convinced, but I don't feel strongly about it.  (If the
extra charsets hadn't been added before mule-unicode, we'd just have
covered the BMP with more mule-unicode ones.)

> So, I suggest changing the above line (and similar lines in
> the other subst-XXX.el) to:
>
>      (if (>= unicode #x2e80)
> 	 (puthash unicode  char ucs-unicode-to-mule-cjk))
>
> and modify ccl-decode-mule-utf-8 to check translation also
> for those characters.
>
> Dave, what do you think?  Does such a change leads to any
> problem?

As far as I remember, it includes too much, and you end up displaying
some characters double width that probably shouldn't be, but I don't
remember which.  How about including the ranges of the double-width
Western characters and the high CJK stuff explicitly?  I guess it
doesn't expand the tables greatly.