Re: Emacs 23 character code space

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

From: Kenichi Handa <handa@m17n.org>
To: Eli Zaretskii <eliz@gnu.org>
Cc: emacs-devel@gnu.org
Subject: Re: Emacs 23 character code space
Date: Thu, 27 Nov 2008 10:29:50 +0900	[thread overview]
Message-ID: <E1L5VhW-0008Rk-D7@etlken.m17n.org> (raw)
In-Reply-To: <uiqqags1p.fsf@gnu.org> (message from Eli Zaretskii on Wed, 26 Nov 2008 22:18:10 +0200)

In article <uiqqags1p.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> > For instance, to get a glyph-code of X font, we decode a
> > character by a charset with that the font encodes glyph
> > codes.

> But that's not really "decoding", is it?  By "decoding" we usually
> mean conversion _to_ the Emacs internal representation, whereas in
> your example, we convert _from_ the internal representation to some
> other.

Oops, sorry, I myself confused decoding and encoding.  Yes
the above is encoding.  And I did the same mistake in my
followup mail.

> To avoid confusion, I suggest to talk about "conversion" of Emacs
> characters to code points of a charset.  Do you agree?

As we have functions encode-char and decode-char, I think it
is better to keep using the words "encoding" and "decoding"
for both kind of conversions; i.e. character <->
(charset . code-point), and string/buffer <-> byte-sequence.

> > From: Kenichi Handa <handa@m17n.org>
[...]
> > I'll explain it a little bit more.  To decode a character
> > sequence to a byte sequence, Emacs actually does two kinds
> > of decoding as below:

As I wrote above, I made a mistake here.  So, I'll
paraphrase it as below.

To convert between a character sequence and a byte sequence,
Emacs actually does two steps of conversions as below.

characters --(1)-> (charset code-point) pairs --(3)-> bytes
           <-(2)--                            <-(4)--     

For the encoding of (1), Emacs uses infomaiton of coding
system to decide which charset to use, and then uses
informaiton of the selected charset to get a code point.
For the decoding of (2), Emacs uses informaiton of charset
to get character codes. 

For the encoding of (3) and the decoding of (4), Emacs uses
only information of coding system.

> Can you give a couple of examples, for some popular charsets, and how
> we decode bytes into characters thru these pairs of charsets and code
> points?

Ok.

Ex.1  utf-8

(1) and (2) are straight forward because charset is
`unicode' and Emacs character code and the code-point in
`unicode' are the same.  (3) decodes each (unicode
CODE-POINT) to utf-8 byte sequence, (4) does the reverse
conversion.

 "a\x3042x" -(1)-> (unicode #x61) (unicode #x3042) (unicode #x78)
            -(3)-> "#x61 #xE3 #x81 #x82 #x78"

Ex.2 iso-8859-2

(1) encodes each charater to code points of the charset
iso-8859-2 by the information of that charset, and (2) does
the reverse conversion.  (3) and (4) are straight forward
because the code-point sequence and the byte sequence are
the same.

Ex.3 iso-2022-jp (japanese)

(1) at first decides which charset (among what supported by
iso-2022-jp) to use for each character, and then encode the
charater to the correspoding (charset code-point) pair.  (2)
does the decoding using information of charset only.  (3)
generates a byte sequence from each code-point (one byte for
a charset of dimension 1, two bytes for a charset of
dimension 2), and also inserts a proper designation byte
sequence at charset boundary.
 "a\x3042x" -(1)-> (ascii #x61) (japanese-jisx0208 #x2422) (aciii #x78)
            -(3)-> "#x61 ESC $ B #x24 #x22 ESC ( B #x78"

Ex.4 gb2312 (chinese)

 "a\x3042x" -(1)-> (ascii #x61) (chinese-gb2312 #x2422) (aciii #x78)
            -(3)-> "#x61 #xA4 #xA2 #x78"

> Thanks.  What confuses me is that, roughly, there's a charset in Emacs
> 23 for every coding-system, and they both have almost identical names.

But there are coding-systems that have multiple charsets.
For instance, big5 coding-system support both ASCII and BIG5
charsets, iso-2022-7bit supports many many charsets.

> For example, the code point of a-umlaut in the iso-8859-1 charset is
> exactly identical to the byte value produced by encoding that
> character with iso-8859-1 coding-system.  So I wonder why we need
> both in Emacs.  Why can't we, for example, decode bytes directly into
> Emacs characters?

Getting a code point from byte sequence and getting a
character code from a code point are different generally
(the above example of iso-8859-1 is rather rare example).  I
hope you understand why by seeing the above examples.

---
Kenichi Handa
handa@ni.aist.go.jp

next prev parent reply	other threads:[~2008-11-27  1:29 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-11-01 14:20 Emacs 23 character code space Eli Zaretskii
2008-11-01 16:46 ` Eli Zaretskii
2008-11-03  1:34 ` Kenichi Handa
2008-11-03 12:45   ` Kenichi Handa
2008-11-03 20:13     ` Eli Zaretskii
2008-11-04  7:35       ` Kenichi Handa
2008-11-04 20:19         ` Eli Zaretskii
2008-11-05 12:27           ` Kenichi Handa
2008-11-05 18:23             ` Eli Zaretskii
2008-11-22 18:25             ` Eli Zaretskii
2008-11-26  1:41               ` Kenichi Handa
2008-11-26  4:13                 ` Eli Zaretskii
2008-11-26  4:24                   ` Kenichi Handa
2008-11-26  4:58                     ` Kenichi Handa
2008-11-26 20:26                       ` Eli Zaretskii
2008-11-26 22:52                         ` Juanma Barranquero
2008-11-27  1:10                         ` Stephen J. Turnbull
2008-11-27  1:35                           ` Kenichi Handa
2008-11-26 20:18                     ` Eli Zaretskii
2008-11-27  1:29                       ` Kenichi Handa [this message]
2008-11-29 17:12                         ` Eli Zaretskii
2008-12-02  5:40                           ` Kenichi Handa
2008-11-28 13:19                 ` Eli Zaretskii
2008-12-02  5:44                   ` Kenichi Handa
2008-12-02 19:40                     ` Eli Zaretskii
2008-11-29 12:01             ` Eli Zaretskii
2008-11-22 16:28     ` Eli Zaretskii
2008-11-23  4:16       ` Stefan Monnier
2008-11-23 11:22         ` Eli Zaretskii
2008-11-26  1:51         ` Kenichi Handa
2008-11-23  8:29       ` Ulrich Mueller
2008-11-23 11:11         ` Eli Zaretskii
2008-11-23 11:55           ` Ulrich Mueller
2008-11-24  3:06         ` Stefan Monnier
2008-11-26  1:31       ` Kenichi Handa
2008-11-22 17:03     ` New function: what-file-line, used when writing gdb script richardeng
2008-11-07  7:21 ` Emacs 23 character code space Kenichi Handa
2008-11-07 10:27   ` Eli Zaretskii
2008-11-07 11:52     ` Kenichi Handa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=E1L5VhW-0008Rk-D7@etlken.m17n.org \
    --to=handa@m17n.org \
    --cc=eliz@gnu.org \
    --cc=emacs-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.