From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Kenichi Handa <handa@m17n.org>
Newsgroups: gmane.emacs.devel
Subject: Re: Emacs 23 character code space
Date: Thu, 27 Nov 2008 10:29:50 +0900
Message-ID: <E1L5VhW-0008Rk-D7@etlken.m17n.org>
References: <u63n7wmri.fsf@gnu.org>
	<E1KwoKX-0002Tk-Lp@etlken.m17n.org>	<E1Kwyo4-0007Vt-Ai@etlken.m17n.org>
	<umyggva8e.fsf@gnu.org>	<E1KxGRS-00087p-G2@etlken.m17n.org>
	<uhc6nutv5.fsf@gnu.org>	<E1KxhUI-0005VH-CR@etlken.m17n.org>
	<uiqqfipn3.fsf@gnu.org>	<E1L59On-0005aa-BP@etlken.m17n.org>
	<uljv7gm4j.fsf@gnu.org>	<E1L5Bwf-0006ku-Ci@etlken.m17n.org>
	<uiqqags1p.fsf@gnu.org>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya")
Content-Type: text/plain; charset=US-ASCII
X-Trace: ger.gmane.org 1227749418 24889 80.91.229.12 (27 Nov 2008 01:30:18 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Thu, 27 Nov 2008 01:30:18 +0000 (UTC)
Cc: emacs-devel@gnu.org
To: Eli Zaretskii <eliz@gnu.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Nov 27 02:31:20 2008
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.50)
	id 1L5Vix-0004t0-T2
	for ged-emacs-devel@m.gmane.org; Thu, 27 Nov 2008 02:31:20 +0100
Original-Received: from localhost ([127.0.0.1]:45623 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1L5Vhn-0005KH-Pk
	for ged-emacs-devel@m.gmane.org; Wed, 26 Nov 2008 20:30:07 -0500
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1L5Vhh-0005Gv-6U
	for emacs-devel@gnu.org; Wed, 26 Nov 2008 20:30:01 -0500
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1L5Vhe-0005E3-N9
	for emacs-devel@gnu.org; Wed, 26 Nov 2008 20:29:59 -0500
Original-Received: from [199.232.76.173] (port=40950 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1L5Vhe-0005Dr-I6
	for emacs-devel@gnu.org; Wed, 26 Nov 2008 20:29:58 -0500
Original-Received: from mx1.aist.go.jp ([150.29.246.133]:64990)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <handa@m17n.org>)
	id 1L5Vha-0001qG-PA; Wed, 26 Nov 2008 20:29:55 -0500
Original-Received: from rqsmtp2.aist.go.jp (rqsmtp2.aist.go.jp [150.29.254.123])
	by mx1.aist.go.jp  with ESMTP id mAR1Tpbq026031;
	Thu, 27 Nov 2008 10:29:51 +0900 (JST) env-from (handa@m17n.org)
Original-Received: from smtp4.aist.go.jp
	by rqsmtp2.aist.go.jp  with ESMTP id mAR1TpmX018680;
	Thu, 27 Nov 2008 10:29:51 +0900 (JST) env-from (handa@m17n.org)
Original-Received: by smtp4.aist.go.jp  with ESMTP id mAR1To1O025340;
	Thu, 27 Nov 2008 10:29:50 +0900 (JST) env-from (handa@m17n.org)
Original-Received: from handa by etlken.m17n.org with local (Exim 4.69)
	(envelope-from <handa@m17n.org>)
	id 1L5VhW-0008Rk-D7; Thu, 27 Nov 2008 10:29:50 +0900
In-reply-to: <uiqqags1p.fsf@gnu.org> (message from Eli Zaretskii on Wed, 26
	Nov 2008 22:18:10 +0200)
User-Agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2
	Emacs/23.0.60 (i686-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO)
X-detected-operating-system: by monty-python.gnu.org: Solaris 9
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:106217
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/106217>

In article <uiqqags1p.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> > For instance, to get a glyph-code of X font, we decode a
> > character by a charset with that the font encodes glyph
> > codes.

> But that's not really "decoding", is it?  By "decoding" we usually
> mean conversion _to_ the Emacs internal representation, whereas in
> your example, we convert _from_ the internal representation to some
> other.

Oops, sorry, I myself confused decoding and encoding.  Yes
the above is encoding.  And I did the same mistake in my
followup mail.

> To avoid confusion, I suggest to talk about "conversion" of Emacs
> characters to code points of a charset.  Do you agree?

As we have functions encode-char and decode-char, I think it
is better to keep using the words "encoding" and "decoding"
for both kind of conversions; i.e. character <->
(charset . code-point), and string/buffer <-> byte-sequence.

> > From: Kenichi Handa <handa@m17n.org>
[...]
> > I'll explain it a little bit more.  To decode a character
> > sequence to a byte sequence, Emacs actually does two kinds
> > of decoding as below:

As I wrote above, I made a mistake here.  So, I'll
paraphrase it as below.

To convert between a character sequence and a byte sequence,
Emacs actually does two steps of conversions as below.


characters --(1)-> (charset code-point) pairs --(3)-> bytes
           <-(2)--                            <-(4)--     

For the encoding of (1), Emacs uses infomaiton of coding
system to decide which charset to use, and then uses
informaiton of the selected charset to get a code point.
For the decoding of (2), Emacs uses informaiton of charset
to get character codes. 

For the encoding of (3) and the decoding of (4), Emacs uses
only information of coding system.

> Can you give a couple of examples, for some popular charsets, and how
> we decode bytes into characters thru these pairs of charsets and code
> points?

Ok.

Ex.1  utf-8

(1) and (2) are straight forward because charset is
`unicode' and Emacs character code and the code-point in
`unicode' are the same.  (3) decodes each (unicode
CODE-POINT) to utf-8 byte sequence, (4) does the reverse
conversion.

 "a\x3042x" -(1)-> (unicode #x61) (unicode #x3042) (unicode #x78)
            -(3)-> "#x61 #xE3 #x81 #x82 #x78"

Ex.2 iso-8859-2

(1) encodes each charater to code points of the charset
iso-8859-2 by the information of that charset, and (2) does
the reverse conversion.  (3) and (4) are straight forward
because the code-point sequence and the byte sequence are
the same.

Ex.3 iso-2022-jp (japanese)

(1) at first decides which charset (among what supported by
iso-2022-jp) to use for each character, and then encode the
charater to the correspoding (charset code-point) pair.  (2)
does the decoding using information of charset only.  (3)
generates a byte sequence from each code-point (one byte for
a charset of dimension 1, two bytes for a charset of
dimension 2), and also inserts a proper designation byte
sequence at charset boundary.
 "a\x3042x" -(1)-> (ascii #x61) (japanese-jisx0208 #x2422) (aciii #x78)
            -(3)-> "#x61 ESC $ B #x24 #x22 ESC ( B #x78"

Ex.4 gb2312 (chinese)

 "a\x3042x" -(1)-> (ascii #x61) (chinese-gb2312 #x2422) (aciii #x78)
            -(3)-> "#x61 #xA4 #xA2 #x78"

> Thanks.  What confuses me is that, roughly, there's a charset in Emacs
> 23 for every coding-system, and they both have almost identical names.

But there are coding-systems that have multiple charsets.
For instance, big5 coding-system support both ASCII and BIG5
charsets, iso-2022-7bit supports many many charsets.

> For example, the code point of a-umlaut in the iso-8859-1 charset is
> exactly identical to the byte value produced by encoding that
> character with iso-8859-1 coding-system.  So I wonder why we need
> both in Emacs.  Why can't we, for example, decode bytes directly into
> Emacs characters?

Getting a code point from byte sequence and getting a
character code from a code point are different generally
(the above example of iso-8859-1 is rather rare example).  I
hope you understand why by seeing the above examples.

---
Kenichi Handa
handa@ni.aist.go.jp