From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Kenichi Handa Newsgroups: gmane.emacs.devel Subject: Re: What exactly is chinese-big5? Date: Fri, 18 Apr 2008 20:28:08 +0900 Message-ID: References: NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya") Content-Type: text/plain; charset=US-ASCII X-Trace: ger.gmane.org 1208613424 16772 80.91.229.12 (19 Apr 2008 13:57:04 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 19 Apr 2008 13:57:04 +0000 (UTC) Cc: emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Apr 19 15:57:37 2008 connect(): Connection refused Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1JmomG-0007yn-SS for ged-emacs-devel@m.gmane.org; Fri, 18 Apr 2008 13:29:13 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Jmolc-0004nL-0Q for ged-emacs-devel@m.gmane.org; Fri, 18 Apr 2008 07:28:32 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1JmolS-0004jd-V2 for emacs-devel@gnu.org; Fri, 18 Apr 2008 07:28:23 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1JmolR-0004iu-Bo for emacs-devel@gnu.org; Fri, 18 Apr 2008 07:28:22 -0400 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1JmolR-0004im-55 for emacs-devel@gnu.org; Fri, 18 Apr 2008 07:28:21 -0400 Original-Received: from mx1.aist.go.jp ([150.29.246.133]) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1JmolN-0004ZF-3b; Fri, 18 Apr 2008 07:28:17 -0400 Original-Received: from rqsmtp1.aist.go.jp (rqsmtp1.aist.go.jp [150.29.254.115]) by mx1.aist.go.jp with ESMTP id m3IBSB8g023251; Fri, 18 Apr 2008 20:28:11 +0900 (JST) env-from (handa@m17n.org) Original-Received: from smtp1.aist.go.jp by rqsmtp1.aist.go.jp with ESMTP id m3IBSAhE028719; Fri, 18 Apr 2008 20:28:10 +0900 (JST) env-from (handa@m17n.org) Original-Received: by smtp1.aist.go.jp with ESMTP id m3IBS8lG028872; Fri, 18 Apr 2008 20:28:08 +0900 (JST) env-from (handa@m17n.org) Original-Received: from handa by etlken.m17n.org with local (Exim 4.69) (envelope-from ) id 1JmolE-0005ZC-06; Fri, 18 Apr 2008 20:28:08 +0900 In-reply-to: (message from Eli Zaretskii on Fri, 18 Apr 2008 11:16:39 +0300) User-Agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/23.0.60 (i686-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO) X-detected-kernel: by monty-python.gnu.org: Solaris 8 (1) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:95452 Archived-At: In article , Eli Zaretskii writes: > > In Emacs 22, you can read the written file by utf-8 and > > search for U+FFFD. > Is U+FFFD the _only_ character that will be produced for any codepoint > that is unassigned in the Big5 code space? That is, if I search for > U+FFFD, will I find _all_ the places where the original file had > something not belonging to Big5? No exactly. U+FFFD is the only character that will be produced for "any character that can't be unified with Unicode". Which Big5 character can unified with Unicode is defined in subst-big5.el in Emacs 22 (I don't know which Big5 version Dave used to make that file) and in etc/charsets/BIG5.map in Emacs 23. So, if the dialect of Big5 is different from what defined in those files, there's a possibility that some character which the file creater thinks Big5 is encoded into U+FFFD. > Also, assuming that I find one or more invalid characters, is there > some encoding other than chinese-big5 that I should try, which could > explain those problematic characters, besides those I mentioned in my > original message? This file came from Chinese speaking people, so > there's little doubt it should include only strings that can be read > by Chinese speakers. Therefore, I wonder how come it does not > translate cleanly into Unicode. (I cannot ask the people who produced > the file about these issues, since they seem to be pretty ignorant > about that: they claimed the file was in UTF-8...) That file may be GBK whose code-space is similar to but wider than Big5. But, it's supported only in Emacs 23. --- Kenichi Handa handa@ni.aist.go.jp