From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Kenichi Handa Newsgroups: gmane.emacs.devel Subject: Re: Emacs 23 character code space Date: Thu, 27 Nov 2008 10:29:50 +0900 Message-ID: References: NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya") Content-Type: text/plain; charset=US-ASCII X-Trace: ger.gmane.org 1227749418 24889 80.91.229.12 (27 Nov 2008 01:30:18 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 27 Nov 2008 01:30:18 +0000 (UTC) Cc: emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Nov 27 02:31:20 2008 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1L5Vix-0004t0-T2 for ged-emacs-devel@m.gmane.org; Thu, 27 Nov 2008 02:31:20 +0100 Original-Received: from localhost ([127.0.0.1]:45623 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1L5Vhn-0005KH-Pk for ged-emacs-devel@m.gmane.org; Wed, 26 Nov 2008 20:30:07 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1L5Vhh-0005Gv-6U for emacs-devel@gnu.org; Wed, 26 Nov 2008 20:30:01 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1L5Vhe-0005E3-N9 for emacs-devel@gnu.org; Wed, 26 Nov 2008 20:29:59 -0500 Original-Received: from [199.232.76.173] (port=40950 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1L5Vhe-0005Dr-I6 for emacs-devel@gnu.org; Wed, 26 Nov 2008 20:29:58 -0500 Original-Received: from mx1.aist.go.jp ([150.29.246.133]:64990) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1L5Vha-0001qG-PA; Wed, 26 Nov 2008 20:29:55 -0500 Original-Received: from rqsmtp2.aist.go.jp (rqsmtp2.aist.go.jp [150.29.254.123]) by mx1.aist.go.jp with ESMTP id mAR1Tpbq026031; Thu, 27 Nov 2008 10:29:51 +0900 (JST) env-from (handa@m17n.org) Original-Received: from smtp4.aist.go.jp by rqsmtp2.aist.go.jp with ESMTP id mAR1TpmX018680; Thu, 27 Nov 2008 10:29:51 +0900 (JST) env-from (handa@m17n.org) Original-Received: by smtp4.aist.go.jp with ESMTP id mAR1To1O025340; Thu, 27 Nov 2008 10:29:50 +0900 (JST) env-from (handa@m17n.org) Original-Received: from handa by etlken.m17n.org with local (Exim 4.69) (envelope-from ) id 1L5VhW-0008Rk-D7; Thu, 27 Nov 2008 10:29:50 +0900 In-reply-to: (message from Eli Zaretskii on Wed, 26 Nov 2008 22:18:10 +0200) User-Agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/23.0.60 (i686-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO) X-detected-operating-system: by monty-python.gnu.org: Solaris 9 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:106217 Archived-At: In article , Eli Zaretskii writes: > > For instance, to get a glyph-code of X font, we decode a > > character by a charset with that the font encodes glyph > > codes. > But that's not really "decoding", is it? By "decoding" we usually > mean conversion _to_ the Emacs internal representation, whereas in > your example, we convert _from_ the internal representation to some > other. Oops, sorry, I myself confused decoding and encoding. Yes the above is encoding. And I did the same mistake in my followup mail. > To avoid confusion, I suggest to talk about "conversion" of Emacs > characters to code points of a charset. Do you agree? As we have functions encode-char and decode-char, I think it is better to keep using the words "encoding" and "decoding" for both kind of conversions; i.e. character <-> (charset . code-point), and string/buffer <-> byte-sequence. > > From: Kenichi Handa [...] > > I'll explain it a little bit more. To decode a character > > sequence to a byte sequence, Emacs actually does two kinds > > of decoding as below: As I wrote above, I made a mistake here. So, I'll paraphrase it as below. To convert between a character sequence and a byte sequence, Emacs actually does two steps of conversions as below. characters --(1)-> (charset code-point) pairs --(3)-> bytes <-(2)-- <-(4)-- For the encoding of (1), Emacs uses infomaiton of coding system to decide which charset to use, and then uses informaiton of the selected charset to get a code point. For the decoding of (2), Emacs uses informaiton of charset to get character codes. For the encoding of (3) and the decoding of (4), Emacs uses only information of coding system. > Can you give a couple of examples, for some popular charsets, and how > we decode bytes into characters thru these pairs of charsets and code > points? Ok. Ex.1 utf-8 (1) and (2) are straight forward because charset is `unicode' and Emacs character code and the code-point in `unicode' are the same. (3) decodes each (unicode CODE-POINT) to utf-8 byte sequence, (4) does the reverse conversion. "a\x3042x" -(1)-> (unicode #x61) (unicode #x3042) (unicode #x78) -(3)-> "#x61 #xE3 #x81 #x82 #x78" Ex.2 iso-8859-2 (1) encodes each charater to code points of the charset iso-8859-2 by the information of that charset, and (2) does the reverse conversion. (3) and (4) are straight forward because the code-point sequence and the byte sequence are the same. Ex.3 iso-2022-jp (japanese) (1) at first decides which charset (among what supported by iso-2022-jp) to use for each character, and then encode the charater to the correspoding (charset code-point) pair. (2) does the decoding using information of charset only. (3) generates a byte sequence from each code-point (one byte for a charset of dimension 1, two bytes for a charset of dimension 2), and also inserts a proper designation byte sequence at charset boundary. "a\x3042x" -(1)-> (ascii #x61) (japanese-jisx0208 #x2422) (aciii #x78) -(3)-> "#x61 ESC $ B #x24 #x22 ESC ( B #x78" Ex.4 gb2312 (chinese) "a\x3042x" -(1)-> (ascii #x61) (chinese-gb2312 #x2422) (aciii #x78) -(3)-> "#x61 #xA4 #xA2 #x78" > Thanks. What confuses me is that, roughly, there's a charset in Emacs > 23 for every coding-system, and they both have almost identical names. But there are coding-systems that have multiple charsets. For instance, big5 coding-system support both ASCII and BIG5 charsets, iso-2022-7bit supports many many charsets. > For example, the code point of a-umlaut in the iso-8859-1 charset is > exactly identical to the byte value produced by encoding that > character with iso-8859-1 coding-system. So I wonder why we need > both in Emacs. Why can't we, for example, decode bytes directly into > Emacs characters? Getting a code point from byte sequence and getting a character code from a code point are different generally (the above example of iso-8859-1 is rather rare example). I hope you understand why by seeing the above examples. --- Kenichi Handa handa@ni.aist.go.jp