From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Kenichi Handa Newsgroups: gmane.emacs.devel Subject: Re: Possible UTF-8 CJK Regressions in Terminal Emulators Date: Thu, 10 Jun 2004 09:20:33 +0900 (JST) Sender: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Message-ID: <200406100020.JAA12785@etlken.m17n.org> References: <1077643915.12919.2.camel@duende> <1077682436.28482.9.camel@duende> <200403010815.RAA14365@etlken.m17n.org> <200404071230.VAA25159@etlken.m17n.org> <200404091128.UAA02120@etlken.m17n.org> <200406071227.VAA06216@etlken.m17n.org> <20040607123615.GA29450@fencepost> <200406071300.WAA06332@etlken.m17n.org> <200406090737.QAA11090@etlken.m17n.org> NNTP-Posting-Host: deer.gmane.org Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya") Content-Type: text/plain; charset=US-ASCII X-Trace: sea.gmane.org 1086826877 7612 80.91.224.253 (10 Jun 2004 00:21:17 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Thu, 10 Jun 2004 00:21:17 +0000 (UTC) Cc: mariano@gnome.org, alexander.winston@comcast.net, d.love@dl.ac.uk, emacs-devel@gnu.org, danilo@gnome.org, miles@gnu.org Original-X-From: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Thu Jun 10 02:21:04 2004 Return-path: Original-Received: from quimby.gnus.org ([80.91.224.244]) by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 1BYDJY-0001E9-00 for ; Thu, 10 Jun 2004 02:21:04 +0200 Original-Received: from lists.gnu.org ([199.232.76.165]) by quimby.gnus.org with esmtp (Exim 3.35 #1 (Debian)) id 1BYDJY-00025f-00 for ; Thu, 10 Jun 2004 02:21:04 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.33) id 1BYDKB-00037Y-KB for emacs-devel@quimby.gnus.org; Wed, 09 Jun 2004 20:21:43 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.33) id 1BYDK3-00037T-L6 for emacs-devel@gnu.org; Wed, 09 Jun 2004 20:21:35 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.33) id 1BYDK2-00037D-LP for emacs-devel@gnu.org; Wed, 09 Jun 2004 20:21:35 -0400 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.33) id 1BYDK2-00037A-IU for emacs-devel@gnu.org; Wed, 09 Jun 2004 20:21:34 -0400 Original-Received: from [192.47.44.130] (helo=tsukuba.m17n.org) by monty-python.gnu.org with esmtp (Exim 4.34) id 1BYDJC-0004co-Nw; Wed, 09 Jun 2004 20:20:43 -0400 Original-Received: from fs.m17n.org (fs.m17n.org [192.47.44.2]) by tsukuba.m17n.org (8.11.6p2/8.11.6) with ESMTP id i5A0KZQ11071; Thu, 10 Jun 2004 09:20:35 +0900 (JST) Original-Received: from etlken.m17n.org (etlken.m17n.org [192.47.44.125]) by fs.m17n.org (8.11.6p2/8.11.6) with ESMTP id i5A0KYW01120; Thu, 10 Jun 2004 09:20:34 +0900 (JST) Original-Received: (from handa@localhost) by etlken.m17n.org (8.8.8+Sun/3.7W-2001040620) id JAA12785; Thu, 10 Jun 2004 09:20:33 +0900 (JST) Original-To: monnier@iro.umontreal.ca In-reply-to: (message from Stefan Monnier on 09 Jun 2004 05:38:30 -0400) User-Agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/21.3 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.4 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Xref: main.gmane.org gmane.emacs.devel:24785 X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:24785 In article , Stefan Monnier writes: > > As surrogate pair was not handled well by UTF-16 converter, > > I've just fixed it too (not yet installed, I'm now adding > > comments in a code). Untranslatable characters are decoded > > into UTF-8 form represented by the sequence of > > eight-bit-graphic/control characters (the same way as UTF-8 > > decoding, thus we can use utf-8-post-read-conversion). The > > UTF-16 encoder encodes such a sequence back to the origianl > > UTF-16 form. So, now the UTF-16 support is at the same > > level as UTF-8. > Does that mean that some sequences of eight-bit-graphic/control are not > encoded into the corresponding raw bytes? No. But, that's only the case that we encode a modified text (i.e. eight-bit-graphic/control chars are added/modified after we decoded a source). > If so, that makes me a bit uneasy, since those special chars were > introduced specifically to handle things like binary input or > bad-byte-sequences and make sure that we at least preserve the raw bytes in > those cases. As far as we encode a non-modified text that is generated by decoding a source, we can preserve the byte sequence even if the original source contains bad-byte-sequence (for the case of UTF-8, I found a case that doesn't work as expected and fixed). --- Ken'ichi HANDA handa@m17n.org