From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Kenichi Handa <handa@m17n.org>
Newsgroups: gmane.emacs.devel
Subject: Re: What exactly is chinese-big5?
Date: Fri, 18 Apr 2008 10:32:15 +0900
Message-ID: <E1JmfSZ-0004LZ-U3@etlken.m17n.org>
References: <E1JmS52-0000vS-J3@fencepost.gnu.org>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya")
Content-Type: text/plain; charset=US-ASCII
X-Trace: ger.gmane.org 1208482402 9217 80.91.229.12 (18 Apr 2008 01:33:22 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Fri, 18 Apr 2008 01:33:22 +0000 (UTC)
Cc: emacs-devel@gnu.org
To: Eli Zaretskii <eliz@gnu.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Apr 18 03:33:46 2008
connect(): Connection refused
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.50)
	id 1JmfTw-00021P-Rq
	for ged-emacs-devel@m.gmane.org; Fri, 18 Apr 2008 03:33:41 +0200
Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1JmfTI-00021l-0j
	for ged-emacs-devel@m.gmane.org; Thu, 17 Apr 2008 21:33:00 -0400
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1JmfTC-0001zD-VC
	for emacs-devel@gnu.org; Thu, 17 Apr 2008 21:32:54 -0400
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1JmfTA-0001xF-H4
	for emacs-devel@gnu.org; Thu, 17 Apr 2008 21:32:53 -0400
Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1JmfTA-0001x5-9m
	for emacs-devel@gnu.org; Thu, 17 Apr 2008 21:32:52 -0400
Original-Received: from mx1.aist.go.jp ([150.29.246.133])
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <handa@m17n.org>)
	id 1JmfT5-0003kC-9V; Thu, 17 Apr 2008 21:32:47 -0400
Original-Received: from rqsmtp1.aist.go.jp (rqsmtp1.aist.go.jp [150.29.254.115])
	by mx1.aist.go.jp  with ESMTP id m3I1WScc001456;
	Fri, 18 Apr 2008 10:32:30 +0900 (JST) env-from (handa@m17n.org)
Original-Received: from smtp3.aist.go.jp
	by rqsmtp1.aist.go.jp  with ESMTP id m3I1WRaG016203;
	Fri, 18 Apr 2008 10:32:27 +0900 (JST) env-from (handa@m17n.org)
Original-Received: by smtp3.aist.go.jp  with ESMTP id m3I1WGAk020181;
	Fri, 18 Apr 2008 10:32:16 +0900 (JST) env-from (handa@m17n.org)
Original-Received: from handa by etlken.m17n.org with local (Exim 4.69)
	(envelope-from <handa@m17n.org>)
	id 1JmfSZ-0004LZ-U3; Fri, 18 Apr 2008 10:32:15 +0900
In-reply-to: <E1JmS52-0000vS-J3@fencepost.gnu.org> (message from Eli Zaretskii
	on Thu, 17 Apr 2008 07:15:04 -0400)
User-Agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2
	Emacs/23.0.60 (i686-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO)
X-detected-kernel: by monty-python.gnu.org: Solaris 8 (1)
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:95422
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/95422>

In article <E1JmS52-0000vS-J3@fencepost.gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> Emacs 22.2 supports the chinese-big5 encoding.  However, I cannot find
> anywhere the precise description of which flavor(s) of BIG5 is/are
> supported.  The Wiki page http://en.wikipedia.org/wiki/Big5 describes
> half a dozen of extensions to the original Big5 encoding, so it would
> be good to know which one(s) we support.

Emacs supports full range of Big5 code space; i.e. 
   1st byte: 0xA1 .. 0xFE
   2nd byte: 0x40 .. 0x7E and 0xA1 .. 0xFE

It means that Emacs can decode all code points of Big5 that
fit in the above range.  And, Emacs 22 doesn't pay attention
to which code point is assigned to which Chinese character.
It has seperate character space for Big5 characters (in
charsets chinese-big5-1 and chinese-big5-2) and thus can
contain all possible characters.  Some code point may be
assigned to no character in some variant of Big5.  Emacs
22.2 simply doesn't care about that.

Emacs 23 support all code points of Big5 as well.  It at
first decodes Big5 charaters in a single seperate code space
(#x130000 and over).  Then, unify most of them with Unicode
by using a charset map distributed with glibc
(/usr/share/i18n/charmaps/BIG5.gz).  So, for instance Big5
A140 is decoded and unified to U+3000, but Big5 FEFE is just
decoded to #x134621 (out of unicode range).

> The specific situation where I needed to know this was when I was
> handed a file with what was supposed to be Chinese text and was asked
> to convert it to UTF-8.  detect-coding-region suggested chinese-big5
> as the only Chinese encoding for the non-ASCII characters in the file,
> so I tried that.  Interestingly enough, both `recode' and `iconv'
> refused to convert the file, no matter what flavor of Big5 (including
> cp950) I tried, but Emacs read the file with no problems and produced
> what seems like a valid UTF-8 encoding.  `iconv' 1.12 supports quite a
> few Big5 flavors, but they all choked on some characters in the file.

When you read a Big5 file of the byte sequence FEFE, and try
to write it by utf-8, Emacs 22.2 silently generates U+FFFD
(REPLACEMENT CHARACTER) as described in the docstring of
utf-8 coding system.   So, there's a possibility that the
file you wrote also contains that character.

On the other hand, Emacs 23 warns that only Big5 and
utf-8-emacs can encode it.

> So what exactly is chinese-big5 in Emacs, and how come it succeeds
> where the latest `iconv' fails?

As explained above.

> In particular, should I worry about
> possibly incorrect conversion by Emacs, where `iconv' barfs (the file
> is very large and I cannot proofread all the converted strings)?

In Emacs 22, you can read the written file by utf-8 and
search for U+FFFD.  In Emacs 23, you'll see the warning on
writing the file.

---
Kenichi Handa
handa@ni.aist.go.jp