From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Kenichi Handa Newsgroups: gmane.emacs.devel Subject: Re: What exactly is chinese-big5? Date: Fri, 18 Apr 2008 10:32:15 +0900 Message-ID: References: NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya") Content-Type: text/plain; charset=US-ASCII X-Trace: ger.gmane.org 1208482402 9217 80.91.229.12 (18 Apr 2008 01:33:22 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 18 Apr 2008 01:33:22 +0000 (UTC) Cc: emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Apr 18 03:33:46 2008 connect(): Connection refused Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1JmfTw-00021P-Rq for ged-emacs-devel@m.gmane.org; Fri, 18 Apr 2008 03:33:41 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1JmfTI-00021l-0j for ged-emacs-devel@m.gmane.org; Thu, 17 Apr 2008 21:33:00 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1JmfTC-0001zD-VC for emacs-devel@gnu.org; Thu, 17 Apr 2008 21:32:54 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1JmfTA-0001xF-H4 for emacs-devel@gnu.org; Thu, 17 Apr 2008 21:32:53 -0400 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1JmfTA-0001x5-9m for emacs-devel@gnu.org; Thu, 17 Apr 2008 21:32:52 -0400 Original-Received: from mx1.aist.go.jp ([150.29.246.133]) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1JmfT5-0003kC-9V; Thu, 17 Apr 2008 21:32:47 -0400 Original-Received: from rqsmtp1.aist.go.jp (rqsmtp1.aist.go.jp [150.29.254.115]) by mx1.aist.go.jp with ESMTP id m3I1WScc001456; Fri, 18 Apr 2008 10:32:30 +0900 (JST) env-from (handa@m17n.org) Original-Received: from smtp3.aist.go.jp by rqsmtp1.aist.go.jp with ESMTP id m3I1WRaG016203; Fri, 18 Apr 2008 10:32:27 +0900 (JST) env-from (handa@m17n.org) Original-Received: by smtp3.aist.go.jp with ESMTP id m3I1WGAk020181; Fri, 18 Apr 2008 10:32:16 +0900 (JST) env-from (handa@m17n.org) Original-Received: from handa by etlken.m17n.org with local (Exim 4.69) (envelope-from ) id 1JmfSZ-0004LZ-U3; Fri, 18 Apr 2008 10:32:15 +0900 In-reply-to: (message from Eli Zaretskii on Thu, 17 Apr 2008 07:15:04 -0400) User-Agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/23.0.60 (i686-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO) X-detected-kernel: by monty-python.gnu.org: Solaris 8 (1) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:95422 Archived-At: In article , Eli Zaretskii writes: > Emacs 22.2 supports the chinese-big5 encoding. However, I cannot find > anywhere the precise description of which flavor(s) of BIG5 is/are > supported. The Wiki page http://en.wikipedia.org/wiki/Big5 describes > half a dozen of extensions to the original Big5 encoding, so it would > be good to know which one(s) we support. Emacs supports full range of Big5 code space; i.e. 1st byte: 0xA1 .. 0xFE 2nd byte: 0x40 .. 0x7E and 0xA1 .. 0xFE It means that Emacs can decode all code points of Big5 that fit in the above range. And, Emacs 22 doesn't pay attention to which code point is assigned to which Chinese character. It has seperate character space for Big5 characters (in charsets chinese-big5-1 and chinese-big5-2) and thus can contain all possible characters. Some code point may be assigned to no character in some variant of Big5. Emacs 22.2 simply doesn't care about that. Emacs 23 support all code points of Big5 as well. It at first decodes Big5 charaters in a single seperate code space (#x130000 and over). Then, unify most of them with Unicode by using a charset map distributed with glibc (/usr/share/i18n/charmaps/BIG5.gz). So, for instance Big5 A140 is decoded and unified to U+3000, but Big5 FEFE is just decoded to #x134621 (out of unicode range). > The specific situation where I needed to know this was when I was > handed a file with what was supposed to be Chinese text and was asked > to convert it to UTF-8. detect-coding-region suggested chinese-big5 > as the only Chinese encoding for the non-ASCII characters in the file, > so I tried that. Interestingly enough, both `recode' and `iconv' > refused to convert the file, no matter what flavor of Big5 (including > cp950) I tried, but Emacs read the file with no problems and produced > what seems like a valid UTF-8 encoding. `iconv' 1.12 supports quite a > few Big5 flavors, but they all choked on some characters in the file. When you read a Big5 file of the byte sequence FEFE, and try to write it by utf-8, Emacs 22.2 silently generates U+FFFD (REPLACEMENT CHARACTER) as described in the docstring of utf-8 coding system. So, there's a possibility that the file you wrote also contains that character. On the other hand, Emacs 23 warns that only Big5 and utf-8-emacs can encode it. > So what exactly is chinese-big5 in Emacs, and how come it succeeds > where the latest `iconv' fails? As explained above. > In particular, should I worry about > possibly incorrect conversion by Emacs, where `iconv' barfs (the file > is very large and I cannot proofread all the converted strings)? In Emacs 22, you can read the written file by utf-8 and search for U+FFFD. In Emacs 23, you'll see the warning on writing the file. --- Kenichi Handa handa@ni.aist.go.jp