From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Kenichi Handa Newsgroups: gmane.emacs.devel Subject: Re: UCS-2BE Date: Fri, 01 Sep 2006 10:19:34 +0900 Message-ID: References: <878xl5x4lr.fsf@jurta.org> <44F6A74A.9040708@gnu.org> <44F6BC5B.8010504@gnu.org> <87ac5ko50j.fsf@jurta.org> NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya") Content-Type: text/plain; charset=US-ASCII X-Trace: sea.gmane.org 1157073669 5302 80.91.229.2 (1 Sep 2006 01:21:09 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Fri, 1 Sep 2006 01:21:09 +0000 (UTC) Cc: schwab@suse.de, jasonr@gnu.org, emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Sep 01 03:21:06 2006 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by ciao.gmane.org with esmtp (Exim 4.43) id 1GIxiQ-0004co-Nx for ged-emacs-devel@m.gmane.org; Fri, 01 Sep 2006 03:21:03 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1GIxiQ-0007oa-61 for ged-emacs-devel@m.gmane.org; Thu, 31 Aug 2006 21:21:02 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1GIxiE-0007oS-AO for emacs-devel@gnu.org; Thu, 31 Aug 2006 21:20:50 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1GIxiB-0007oF-LR for emacs-devel@gnu.org; Thu, 31 Aug 2006 21:20:49 -0400 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1GIxiB-0007oB-Fl for emacs-devel@gnu.org; Thu, 31 Aug 2006 21:20:47 -0400 Original-Received: from [150.29.246.133] (helo=mx1.aist.go.jp) by monty-python.gnu.org with esmtp (Exim 4.52) id 1GIxrs-0004zC-Pj; Thu, 31 Aug 2006 21:30:49 -0400 Original-Received: from smtp3.aist.go.jp ([150.29.246.12]) by mx1.aist.go.jp with ESMTP id k811KdaZ026895; Fri, 1 Sep 2006 10:20:40 +0900 (JST) env-from (handa@m17n.org) Original-Received: by smtp3.aist.go.jp with ESMTP id k811KdY9009310; Fri, 1 Sep 2006 10:20:39 +0900 (JST) env-from (handa@m17n.org) Original-Received: from handa by etlken with local (Exim 3.36 #1 (Debian)) id 1GIxh0-000751-00; Fri, 01 Sep 2006 10:19:34 +0900 Original-To: Juri Linkov In-reply-to: <87ac5ko50j.fsf@jurta.org> (message from Juri Linkov on Fri, 01 Sep 2006 02:32:44 +0300) User-Agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/22.0.50 (i686-pc-linux-gnu) MULE/5.0 (SAKAKI) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:59196 Archived-At: In article <87ac5ko50j.fsf@jurta.org>, Juri Linkov writes: > `UCS-2' is the fixed-length encoding of the BMP. `UCS-2BE' is > a big-endian version of the UCS-2 encoding without using a BOM. > So as actually UCS-2 is a BMP subset of UTF-16, UCS-2BE is a BMP > subset of UTF-16BE (and UCS-2LE is a BMP subset of UTF-16LE). Where did you get that info? The word "encoding" is ambiguous here. There are "CEF (Character Encoding Form)" and "CES (Character Encoding Scheme)". Unicode says (see Glossary): Character Encoding Form: Mapping from a character set definition to the actual code units used to represent the data. Character Encoding Scheme: A character encoding form plus byte serialization. ... UCS-XXX are CEF, and UTF-XXX are CES. So, UCS-XXX are not appropriate lavel names for specifing how to byte-serialize characters (i.e. on saving characters in a file). At least, that is the official definition in Unicode. And, as you see now, there's is a contradition in the term "UCS-2BE" because "BE" is information about byte-serialization. But the term "UCS-2BE" itself is not defined in Unicode. So, there are two possibilities: (1) It's just a mis-label of something. (2) It's defined somewhere else. Which is the case? > The encodings `UCS-2' and `UCS-2BE' are implemented in iconv > (http://www.gnu.org/software/libiconv/), so you could look > at the implementation of UCS-2BE: Does it mean that it's an invention of iconv to use those names as CES? > Does the Emacs implementation of UTF-16 output the BOM? Yes. > > (at least, we should not select it by select-safe-coding-system on > > saving a buffer that contains non-BMP characters). > What do you think is the right way to deal with non-BMP characters > when the user will try to save a UTF-16(BE) buffer in the UCS-2(BE) > encoding? It depends on how UCS-2BE is defined. If we follow the implementation of iconv (and if the buffer contains non-BMP characters), we should ask the user to select a proper coding system. --- Kenichi Handa handa@m17n.org