From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Werner LEMBERG Newsgroups: gmane.emacs.bugs Subject: bug#12291: [rev 109796] wrong UTF-8 handling Date: Tue, 28 Aug 2012 07:47:20 +0200 (CEST) Message-ID: <20120828.074720.480105751.wl@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: Multipart/Mixed; boundary="--Next_Part(Tue_Aug_28_07_47_20_2012_714)--" Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1346132893 2667 80.91.229.3 (28 Aug 2012 05:48:13 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 28 Aug 2012 05:48:13 +0000 (UTC) Cc: Curtis Smith To: 12291@debbugs.gnu.org Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Tue Aug 28 07:48:14 2012 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1T6Eeq-0003dz-Rn for geb-bug-gnu-emacs@m.gmane.org; Tue, 28 Aug 2012 07:48:13 +0200 Original-Received: from localhost ([::1]:37172 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1T6Eeo-00034q-Sr for geb-bug-gnu-emacs@m.gmane.org; Tue, 28 Aug 2012 01:48:10 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:35675) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1T6Eem-00034k-Ba for bug-gnu-emacs@gnu.org; Tue, 28 Aug 2012 01:48:09 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1T6Eek-0005gJ-RR for bug-gnu-emacs@gnu.org; Tue, 28 Aug 2012 01:48:08 -0400 Original-Received: from debbugs.gnu.org ([140.186.70.43]:43753) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1T6Eek-0005gE-O0 for bug-gnu-emacs@gnu.org; Tue, 28 Aug 2012 01:48:06 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.72) (envelope-from ) id 1T6Efe-0004Wg-Sa for bug-gnu-emacs@gnu.org; Tue, 28 Aug 2012 01:49:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Werner LEMBERG Original-Sender: debbugs-submit-bounces@debbugs.gnu.org Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 28 Aug 2012 05:49:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 12291 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: X-Debbugs-Original-To: bug-gnu-emacs@gnu.org Original-Received: via spool by submit@debbugs.gnu.org id=B.134613291217358 (code B ref -1); Tue, 28 Aug 2012 05:49:02 +0000 Original-Received: (at submit) by debbugs.gnu.org; 28 Aug 2012 05:48:32 +0000 Original-Received: from localhost ([127.0.0.1]:53298 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T6Ef8-0004Vs-RC for submit@debbugs.gnu.org; Tue, 28 Aug 2012 01:48:31 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:36974) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T6Ef5-0004Vk-90 for submit@debbugs.gnu.org; Tue, 28 Aug 2012 01:48:28 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1T6Ee9-0005eB-3a for submit@debbugs.gnu.org; Tue, 28 Aug 2012 01:47:30 -0400 Original-Received: from lists.gnu.org ([208.118.235.17]:46748) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1T6Ee9-0005e7-03 for submit@debbugs.gnu.org; Tue, 28 Aug 2012 01:47:29 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:35595) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1T6Ee7-00034T-TV for bug-gnu-emacs@gnu.org; Tue, 28 Aug 2012 01:47:28 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1T6Ee6-0005ds-BJ for bug-gnu-emacs@gnu.org; Tue, 28 Aug 2012 01:47:27 -0400 Original-Received: from mailout-de.gmx.net ([213.165.64.22]:42756) by eggs.gnu.org with smtp (Exim 4.71) (envelope-from ) id 1T6Ee6-0005df-1y for bug-gnu-emacs@gnu.org; Tue, 28 Aug 2012 01:47:26 -0400 Original-Received: (qmail invoked by alias); 28 Aug 2012 05:47:23 -0000 Original-Received: from 178-190-192-56.adsl.highway.telekom.at (EHLO localhost) [178.190.192.56] by mail.gmx.net (mp002) with SMTP; 28 Aug 2012 07:47:23 +0200 X-Authenticated: #54312696 X-Provags-ID: V01U2FsdGVkX18SRP1Uhl4S2B8VkSc8PDoPiqQvi21Bu0HwbmTVf5 FSHgxwctFOMLy2 X-Mailer: Mew version 6.4rc1 on Emacs 24.2.50.1 / Mule 6.0 (HANACHIRUSATO) X-Y-GMX-Trusted: 0 X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2) X-Received-From: 140.186.70.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.bugs:63541 Archived-At: ----Next_Part(Tue_Aug_28_07_47_20_2012_714)-- Content-Type: Text/Plain; charset=iso-2022-jp Content-Transfer-Encoding: 7bit [bzr revision 109796] Have a look at the attached file, containing a single character. (It's transmitted as binary to avoid e-mail encoding issues). It contains a single, four-byte UTF-8 encoded character (0xF4 0xB5 0x87 0x9E, which would map to the non-existent Unicode character code U+1351DE). If I load this file as UTF-8 encoded, Emacs gives this as the output of `C-u C-x =': position: 1 of 2 (0%), column: 0 character: 二 (displayed as 二) (codepoint 20108, #o47214, #x4e8c) preferred charset: unicode (Unicode (ISO10646)) code point in charset: 0x4E8C syntax: w which means: word category: .:Base, C:2-byte han, L:Left-to-right (strong), c:Chinese, h:Korean, j:Japanese, |:line breakable to input: type "C-x 8 RET HEX-CODEPOINT" or "C-x 8 RET NAME" buffer code: #xE4 #xBA #x8C file code: #xE4 #xBA #x8C (encoded by coding system utf-8-unix) display: by this font (glyph code) xft:-unknown-SimSun-normal-normal-normal-*-24-*-*-*-d-0-iso10646-1 (#x460) Character code properties: customize what to show name: CJK IDEOGRAPH-4E8C general-category: Lo (Letter, Other) decomposition: (20108) ('二') Look what Emacs says about the file code. If I save this one-character file as UTF-8, the character code stays as-is. This behaviour is clearly wrong. I suspect that Emacs is using such a high character code for internal representation of the `emacs-mule' encoding. However, the user must not see this. Instead, such characters must be converted to correct UTF-8. Werner ====================================================================== In GNU Emacs 24.2.50.1 (i686-pc-linux-gnu, GTK+ Version 2.24.9) of 2012-08-28 on linux-nvf0 Windowing system distributor `The X.Org Foundation', version 11.0.11004000 Configured using: `configure 'MAKEINFO=/usr/bin/makeinfo' '--with-x-toolkit=gtk'' Important settings: value of $LANG: de_DE.UTF-8 value of $XMODIFIERS: @im=none locale-coding-system: utf-8-unix default enable-multibyte-characters: t Major mode: Summary Minor modes in effect: tooltip-mode: t mouse-wheel-mode: t menu-bar-mode: t file-name-shadow-mode: t global-font-lock-mode: t font-lock-mode: t blink-cursor-mode: t auto-composition-mode: t auto-encryption-mode: t auto-compression-mode: t column-number-mode: t transient-mark-mode: t Recent input: w b u g - e m C-c C-q y M-x w r i t e - e m C-g C-h a b u g C-x 1 M-x r e p r t o r t - e m Recent messages: Saving file /home/wl/Mail/draft/11... Wrote /home/wl/Mail/draft/11 Draft is prepared No matching alias [7 times] Kill draft message? (y or n) y Saving file /home/wl/Mail/draft/11... Wrote /home/wl/Mail/draft/11 Draft was killed Quit Type C-x 4 C-o RET to restore the other window. Load-path shadows: None found. Features: (shadow emacsbug message format-spec rfc822 mml mml-sec mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util mail-prsvr mail-utils apropos descr-text latexenc preview prv-emacs byte-opt tex-buf noutline outline font-latex warnings bytecomp byte-compile cconv macroexp latex easy-mmode edmacro kmacro tex-style cus-edit wid-edit cus-start cus-load pp mew-varsx mew-unix cal-menu calendar cal-loaddefs mew-auth mew-config mew-imap2 mew-imap mew-nntp2 mew-nntp mew-pop mew-smtp mew-ssl mew-ssh mew-net mew-highlight mew-sort mew-fib mew-ext mew-refile mew-demo mew-attach mew-draft mew-message mew-thread mew-virtual mew-summary4 mew-summary3 mew-summary2 mew-summary mew-search mew-pick mew-passwd mew-scan mew-syntax mew-bq mew-smime mew-pgp mew-header mew-exec mew-mark mew-mime mew-edit mew-decode mew-encode mew-cache mew-minibuf mew-complete mew-addrbook mew-local mew-vars3 mew-vars2 mew-vars mew-env mew-mule3 mew-mule mew-gemacs mew-key mew-func mew-blvs mew-const mew tex advice help-fns advice-preload tex-site auto-loads quail help-mode easymenu cjktilde disp-table time-date tooltip ediff-hook vc-hooks lisp-float-type mwheel x-win x-dnd tool-bar dnd fontset image regexp-opt fringe tabulated-list newcomment lisp-mode register page menu-bar rfn-eshadow timer select scroll-bar mouse jit-lock font-lock syntax facemenu font-core frame cham georgian utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean japanese hebrew greek romanian slovak czech european ethiopic indian cyrillic chinese case-table epa-hook jka-cmpr-hook help simple abbrev minibuffer loaddefs button faces cus-face files text-properties overlay sha1 md5 base64 format env code-pages mule custom widget hashtable-print-readable backquote make-network-process dbusbind dynamic-setting system-font-setting font-render-setting move-toolbar gtk x-toolkit x multi-tty emacs) ----Next_Part(Tue_Aug_28_07_47_20_2012_714)-- Content-Type: Application/Octet-Stream Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="emacs-problem.utf8" 9LWHngo= ----Next_Part(Tue_Aug_28_07_47_20_2012_714)----