From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Kenichi Handa Newsgroups: gmane.emacs.bugs Subject: bug#12291: [rev 109796] wrong UTF-8 handling Date: Tue, 28 Aug 2012 23:57:39 +0900 Message-ID: <87a9xfdpy4.fsf@gnu.org> References: <20120828.074720.480105751.wl@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-2022-jp X-Trace: ger.gmane.org 1346165952 17017 80.91.229.3 (28 Aug 2012 14:59:12 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 28 Aug 2012 14:59:12 +0000 (UTC) Cc: 12291@debbugs.gnu.org, smithcu@gvsu.edu To: Werner LEMBERG Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Tue Aug 28 16:59:13 2012 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1T6NG1-0000ID-RP for geb-bug-gnu-emacs@m.gmane.org; Tue, 28 Aug 2012 16:59:09 +0200 Original-Received: from localhost ([::1]:39792 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1T6NFz-0002HC-ME for geb-bug-gnu-emacs@m.gmane.org; Tue, 28 Aug 2012 10:59:07 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:51818) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1T6NFx-0002H7-5w for bug-gnu-emacs@gnu.org; Tue, 28 Aug 2012 10:59:06 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1T6NFv-0007p5-VB for bug-gnu-emacs@gnu.org; Tue, 28 Aug 2012 10:59:05 -0400 Original-Received: from debbugs.gnu.org ([140.186.70.43]:44992) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1T6NFv-0007p1-RU for bug-gnu-emacs@gnu.org; Tue, 28 Aug 2012 10:59:03 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.72) (envelope-from ) id 1T6NGs-0002CF-Ap for bug-gnu-emacs@gnu.org; Tue, 28 Aug 2012 11:00:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Kenichi Handa Original-Sender: debbugs-submit-bounces@debbugs.gnu.org Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 28 Aug 2012 15:00:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 12291 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 12291-submit@debbugs.gnu.org id=B12291.13461659468332 (code B ref 12291); Tue, 28 Aug 2012 15:00:02 +0000 Original-Received: (at 12291) by debbugs.gnu.org; 28 Aug 2012 14:59:06 +0000 Original-Received: from localhost ([127.0.0.1]:54526 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T6NFx-0002AK-Pt for submit@debbugs.gnu.org; Tue, 28 Aug 2012 10:59:06 -0400 Original-Received: from fencepost.gnu.org ([208.118.235.10]:55445) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T6NFv-0002AC-7j for 12291@debbugs.gnu.org; Tue, 28 Aug 2012 10:59:04 -0400 Original-Received: from 126.229.accsnet.ne.jp ([202.220.229.126]:52524 helo=ubuntu) by fencepost.gnu.org with esmtpsa (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1T6NEw-0004dL-HT; Tue, 28 Aug 2012 10:58:03 -0400 In-Reply-To: <20120828.074720.480105751.wl@gnu.org> (message from Werner LEMBERG on Tue, 28 Aug 2012 07:47:20 +0200 (CEST)) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2) X-Received-From: 140.186.70.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.bugs:63556 Archived-At: In article <20120828.074720.480105751.wl@gnu.org>, Werner LEMBERG writes: > Have a look at the attached file, containing a single character. > (It's transmitted as binary to avoid e-mail encoding issues). It > contains a single, four-byte UTF-8 encoded character (0xF4 0xB5 0x87 > 0x9E, which would map to the non-existent Unicode character code > U+1351DE). If I load this file as UTF-8 encoded, Emacs gives this as > the output of `C-u C-x =': > position: 1 of 2 (0%), column: 0 > character: 二 (displayed as 二) (codepoint 20108, #o47214, #x4e8c) [...] > Look what Emacs says about the file code. If I save this > one-character file as UTF-8, the character code stays as-is. > This behaviour is clearly wrong. Sure. > I suspect that Emacs is using such a > high character code for internal representation of the `emacs-mule' > encoding. However, the user must not see this. That higher character code area is used for two purposes. One is for reading CJK characters of legacy encoding (euc, sjis, big5, etc). They are decoded into the utf-8-emacs byte sequence corresponding to the higher character cod area. But, on getting their character code, most of them are unified into Unicode BMP characters. But few are left un-unified. Those are private characters in each legacy character set. Another is for supporting non-Unicode characters. The biggest set is GB18030. In both cases, user surely see them. > Instead, such characters must be converted to correct > UTF-8. ??? I don't understand what you means by "correct UTF-8". I think the correct behaviour on reading such a file by utf-8 is to treat each byte as raw-byte. --- Kenichi Handa handa@gnu.org