From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Kenichi Handa <handa@m17n.org>
Newsgroups: gmane.emacs.devel
Subject: Re: Emacs 23 character code space
Date: Mon, 03 Nov 2008 21:45:20 +0900
Message-ID: <E1Kwyo4-0007Vt-Ai@etlken.m17n.org>
References: <u63n7wmri.fsf@gnu.org> <E1KwoKX-0002Tk-Lp@etlken.m17n.org>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya")
Content-Type: text/plain; charset=US-ASCII
X-Trace: ger.gmane.org 1225716383 24806 80.91.229.12 (3 Nov 2008 12:46:23 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Mon, 3 Nov 2008 12:46:23 +0000 (UTC)
Cc: eliz@gnu.org, emacs-devel@gnu.org
To: Kenichi Handa <handa@m17n.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Nov 03 13:47:22 2008
connect(): Connection refused
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.50)
	id 1Kwypr-0004w4-5C
	for ged-emacs-devel@m.gmane.org; Mon, 03 Nov 2008 13:47:11 +0100
Original-Received: from localhost ([127.0.0.1]:60859 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1Kwyok-0001Xp-E5
	for ged-emacs-devel@m.gmane.org; Mon, 03 Nov 2008 07:46:02 -0500
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1KwyoE-00014W-7f
	for emacs-devel@gnu.org; Mon, 03 Nov 2008 07:45:30 -0500
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1KwyoD-00013h-Eq
	for emacs-devel@gnu.org; Mon, 03 Nov 2008 07:45:29 -0500
Original-Received: from [199.232.76.173] (port=55275 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1KwyoC-00013Z-Ry
	for emacs-devel@gnu.org; Mon, 03 Nov 2008 07:45:28 -0500
Original-Received: from mx1.aist.go.jp ([150.29.246.133]:40607)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <handa@m17n.org>)
	id 1Kwyo9-0008VB-9i; Mon, 03 Nov 2008 07:45:25 -0500
Original-Received: from rqsmtp2.aist.go.jp (rqsmtp2.aist.go.jp [150.29.254.123])
	by mx1.aist.go.jp  with ESMTP id mA3CjLjb021613;
	Mon, 3 Nov 2008 21:45:21 +0900 (JST) env-from (handa@m17n.org)
Original-Received: from smtp4.aist.go.jp
	by rqsmtp2.aist.go.jp  with ESMTP id mA3CjLlf004948;
	Mon, 3 Nov 2008 21:45:21 +0900 (JST) env-from (handa@m17n.org)
Original-Received: by smtp4.aist.go.jp  with ESMTP id mA3CjKif013547;
	Mon, 3 Nov 2008 21:45:20 +0900 (JST) env-from (handa@m17n.org)
Original-Received: from handa by etlken.m17n.org with local (Exim 4.69)
	(envelope-from <handa@m17n.org>)
	id 1Kwyo4-0007Vt-Ai; Mon, 03 Nov 2008 21:45:20 +0900
In-reply-to: <E1KwoKX-0002Tk-Lp@etlken.m17n.org> (message from Kenichi Handa
	on Mon, 03 Nov 2008 10:34:09 +0900)
User-Agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2
	Emacs/23.0.60 (i686-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO)
X-detected-operating-system: by monty-python.gnu.org: Solaris 9
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:105294
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/105294>

In article <E1KwoKX-0002Tk-Lp@etlken.m17n.org>, Kenichi Handa <handa@m17n.org> writes:

> I'm now in Vietnam, and the Internet connection is very bad,
> so here's a very short reply.

I moved to another hotel, and the Internet connection is a
little bit better here. :-)

> In article <u63n7wmri.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> > This fragment from etc/NEWS:
> >     The character code space is now 0x0..0x3FFFFF with no gap.
> >     Characters of code 0x0..0x10FFFF are Unicode characters of the same code points.
> >     Characters of code 0x3FFF80..0x3FFFFF are raw 8-bit bytes.

> > seems to contradict itself: it says there's ``no gap'', but the codes
> > between 0x110000 and 0x3FFF7F do constitute a gap, don't they?

> Those are for character codes not unified with Unicode.

I tried to rewrite nonascii.texi to clear the things.  I
finished upto the "Character Code" section as attached.
What do you think about it?

---
Kenichi Handa
handa@ni.aist.go.jp

@c -*-texinfo-*-
@c This is part of the GNU Emacs Lisp Reference Manual.
@c Copyright (C) 1998, 1999, 2001, 2002, 2003, 2004,
@c   2005, 2006, 2007, 2008  Free Software Foundation, Inc.
@c See the file elisp.texi for copying conditions.
@setfilename ../../info/characters
@node Non-ASCII Characters, Searching and Matching, Text, Top
@chapter Non-@acronym{ASCII} Characters
@cindex multibyte characters
@cindex characters, multi-byte
@cindex non-@acronym{ASCII} characters

  This chapter covers the special issues relating to non-@acronym{ASCII}
characters and how they are stored in strings and buffers.

@menu
* Text Representations::    Unibyte and multibyte representations
* Converting Representations::  Converting unibyte to multibyte and vice versa.
* Selecting a Representation::  Treating a byte sequence as unibyte or multi.
* Character Codes::         How unibyte and multibyte relate to
                                codes of individual characters.
* Character Sets::          The space of possible character codes
                                is divided into various character sets.
* Chars and Bytes::         More information about multibyte encodings.
* Splitting Characters::    Converting a character to its byte sequence.
* Scanning Charsets::       Which character sets are used in a buffer?
* Translation of Characters::   Translation tables are used for conversion.
* Coding Systems::          Coding systems are conversions for saving files.
* Input Methods::           Input methods allow users to enter various
                                non-ASCII characters without special keyboards.
* Locales::                 Interacting with the POSIX locale.
@end menu

@node Text Representations
@section Text Representations
@cindex text representations

  Emacs has two @dfn{text representations}---two ways to represent
text in a string or buffer.  These are called @dfn{unibyte} and
@dfn{multibyte}.  Each string, and each buffer, uses one of these two
representations to store a sequence Emacs character.  Emacs classifies
characters into these three; @acronym{ASCII} characters,
non-@acronym{ASCII} charcters, and 8-bit charcters.  8-bit characters
correponds to raw bytes of 128 through 255.  For detail, @xref{Character Codes}.

@cindex unibyte text
@cindex unibyte character
  In unibyte representation, each character occupies one byte and
therefore the possible character codes range from 0 to 255.  Codes 0
through 127 are @acronym{ASCII} characters; the codes from 128 through 255
are 8-bit charactes.  Non-@acronym{ASCII} characters can not be stored
in unibyte text.  We call a character in unibyte text as unibyte
character.

@cindex leading code
@cindex multibyte text
@cindex multibyte character

  In multibyte representation, a character may occupy more than one
byte, and as a result, the full range of Emacs character codes
(#x0..#x3FFFFF) can be stored.  @acronym{ASCII} characters occupy one
byte, non-@acronym{ASCII} characters occupy two to five bytes (the
first byte is in the range #xC2 through #xF8, and the remaining bytes
are in the range #x80 through #xBF), and 8-bit characters occupy two
bytes (the first byte is #xC0 or $xC2, and the second byte is in the
range #x80 through #xBF).  Actually this representation is the same as
UTF-8 with extentions for non-Unicode characters and 8-bit characters.
It is assured that a byte sequence that doesn't fit above never appears
in this representation.

  In a buffer, the buffer-local value of the variable
@code{enable-multibyte-characters} specifies the representation used.
The representation for a string is determined and recorded in the string
when the string is constructed.

@defvar enable-multibyte-characters
This variable specifies the current buffer's text representation.
If it is non-@code{nil}, the buffer contains multibyte text; otherwise,
it contains unibyte text.

You cannot set this variable directly; instead, use the function
@code{set-buffer-multibyte} to change a buffer's representation.
@end defvar

@defvar default-enable-multibyte-characters
This variable's value is entirely equivalent to @code{(default-value
'enable-multibyte-characters)}, and setting this variable changes that
default value.  Setting the local binding of
@code{enable-multibyte-characters} in a specific buffer is not allowed,
but changing the default value is supported, and it is a reasonable
thing to do, because it has no effect on existing buffers.

The @samp{--unibyte} command line option does its job by setting the
default value to @code{nil} early in startup.
@end defvar

@defun position-bytes position
Return the byte-position corresponding to buffer position
@var{position} in the current buffer.  This is 1 at the start of the
buffer, and counts upward in bytes.  If @var{position} is out of
range, the value is @code{nil}.
@end defun

@defun byte-to-position byte-position
Return the buffer position corresponding to byte-position
@var{byte-position} in the current buffer.  If @var{byte-position} is
out of range, the value is @code{nil}.  If @var{byte-position} is not
at a character boundary (in case of multibyte buffer), the value is
the buffer position of the character that occupies @var{byte-position}.
@end defun

@defun multibyte-string-p string
Return @code{t} if @var{string} is a multibyte string.
@end defun

@defun string-bytes string
@cindex string, number of bytes
This function returns the number of bytes in @var{string}.
If @var{string} is a multibyte string, this can be greater than
@code{(length @var{string})}.
@end defun

@node Converting Representations
@section Converting Text Representations

  Emacs can convert unibyte text to multibyte; it can also convert
multibyte text to unibyte provided that the multibyte text contains
only @acronym{ASCII} and 8-bit characters.  In
general these conversions happen when inserting text into a buffer, or
when putting text from several strings together in one string.  You can
also explicitly convert a string's contents to either representation.

  Emacs chooses the representation for a string based on the text that
it is constructed from.  The general rule is to convert unibyte text to
multibyte text when combining it with other multibyte text, because the
multibyte representation is more general and can hold whatever
characters the unibyte text has.

  When inserting text into a buffer, Emacs converts the text to the
buffer's representation, as specified by
@code{enable-multibyte-characters} in that buffer.  In particular, when
you insert multibyte text into a unibyte buffer, Emacs converts the text
to unibyte, even though this conversion cannot in general preserve all
the characters that might be in the multibyte text.  The other natural
alternative, to convert the buffer contents to multibyte, is not
acceptable because the buffer's representation is a choice made by the
user that cannot be overridden automatically.

  Converting unibyte text to multibyte text leaves @acronym{ASCII} characters
unchanged, and converts 8-bit characters (codes 128 through 159) to
the corresponding representation for multibyte text.

  Converting multibyte text to unibyte is simpler: it discards all but
the low 8 bits of each character code.  It effectively converts all
@acronym{ASCII} and 8-bit characters to the corresponding unibyte
representation, but loose information for non-@acronym{ASCII}
characters.  Converting unibyte text to multibyte and back to unibyte
reproduces the original unibyte text.

The next three functions either return the argument @var{string}, or a
newly created string with no text properties.

@defun string-to-multibyte string
This function returns a multibyte string containing the same sequence
of characters as @var{string}.  If @var{string} is a multibyte string,
it is returned unchanged.
@end defun

@defun string-to-unibyte string
This function returns a unibyte string containing the same sequence of
characters as @var{string}.  It signals an error if @var{string}
contains a non-@acronym{ASCII} character.  If @var{string} is a
unibyte string, it is returned unchanged.
@end defun

@defun multibyte-char-to-unibyte char
This convert the multibyte character @var{char} to a unibyte
character.  If @var{char} is a non-@acronym{ASCII} character, the
value is -1.
@end defun

@defun unibyte-char-to-multibyte char
This convert the unibyte character @var{char} to a multibyte
character.
@end defun

@node Selecting a Representation
@section Selecting a Representation

  Sometimes it is useful to examine an existing buffer or string as
multibyte when it was unibyte, or vice versa.

@defun set-buffer-multibyte multibyte
Set the representation type of the current buffer.  If @var{multibyte}
is non-@code{nil}, the buffer becomes multibyte.  If @var{multibyte}
is @code{nil}, the buffer becomes unibyte.

This function leaves the buffer contents unchanged when viewed as a
sequence of bytes.  As a consequence, it can change the contents
viewed as characters; a sequence of three bytes which is treated as
one character in multibyte representation will count as three
characters in unibyte representation.  8-bit characters are an
exception.  They are represented by one byte in a unibyte buffer, but
when the buffer is set to multibyte, they are converted to two-byte
sequences, and vice versa.

This function sets @code{enable-multibyte-characters} to record which
representation is in use.  It also adjusts various data in the buffer
(including overlays, text properties and markers) so that they cover the
same text as they did before.

You cannot use @code{set-buffer-multibyte} on an indirect buffer,
because indirect buffers always inherit the representation of the
base buffer.
@end defun

@defun string-as-unibyte string
This function returns a string with the same bytes as @var{string} but
treating each byte as a character.  This means that the value may have
more characters than @var{string} has.  8-bit characters are an
exception.  Each of them is represented by two bytes in a multibyte
string, but is converted to one byte.

If @var{string} is already a unibyte string, then the value is
@var{string} itself.  Otherwise it is a newly created string, with no
text properties.
@end defun

@defun string-as-multibyte string
This function returns a string with the same bytes as @var{string} but
treating each multibyte sequence as one character.  This means that the
value may have fewer characters than @var{string} has.  If a byte
sequence in @var{string} is invalid as a multibyte representation,
each byte in the sequence is converted to two-byte multibyte
representation of 8-bit characters.

If @var{string} is already a multibyte string, then the value is
@var{string} itself.  Otherwise it is a newly created string, with no
text properties.
@end defun

@node Character Codes
@section Character Codes
@cindex character codes

  The unibyte and multibyte text representations use different
character codes.  The valid character codes for unibyte representation
range from 0 to 255---the values that can fit in one byte.  The valid
character codes for multibyte representation range from 0 to 4194303
(#x3FFFFF).  In this code space, codes 0 through 127 are for
@acronym{ASCII} charcters, codes 129 through 4194175 (#x3FFF7F) are
for non-@acronym{ASCII} characters (among them, codes 0 through
1114111 (#10FFFF) corresponds to Unicode characters of the same
codes), and codes 4194176 (#x3FFF80) through 4194303 (#x3FFFFF) are
for 8-bit characters.

@defun characterp charcode
This returns @code{t} if @var{charcode} is a valid character, and
@code{nil} otherwise.

@example
(characterp 65)
     @result{} t
(characterp 4194303)
     @result{} t
(characterp 4194304)
     @result{} nil
@end example
@end defun