From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Kenichi Handa Newsgroups: gmane.emacs.devel Subject: Re: Emacs 23 character code space Date: Mon, 03 Nov 2008 21:45:20 +0900 Message-ID: References: NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya") Content-Type: text/plain; charset=US-ASCII X-Trace: ger.gmane.org 1225716383 24806 80.91.229.12 (3 Nov 2008 12:46:23 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 3 Nov 2008 12:46:23 +0000 (UTC) Cc: eliz@gnu.org, emacs-devel@gnu.org To: Kenichi Handa Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Nov 03 13:47:22 2008 connect(): Connection refused Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1Kwypr-0004w4-5C for ged-emacs-devel@m.gmane.org; Mon, 03 Nov 2008 13:47:11 +0100 Original-Received: from localhost ([127.0.0.1]:60859 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Kwyok-0001Xp-E5 for ged-emacs-devel@m.gmane.org; Mon, 03 Nov 2008 07:46:02 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1KwyoE-00014W-7f for emacs-devel@gnu.org; Mon, 03 Nov 2008 07:45:30 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1KwyoD-00013h-Eq for emacs-devel@gnu.org; Mon, 03 Nov 2008 07:45:29 -0500 Original-Received: from [199.232.76.173] (port=55275 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1KwyoC-00013Z-Ry for emacs-devel@gnu.org; Mon, 03 Nov 2008 07:45:28 -0500 Original-Received: from mx1.aist.go.jp ([150.29.246.133]:40607) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1Kwyo9-0008VB-9i; Mon, 03 Nov 2008 07:45:25 -0500 Original-Received: from rqsmtp2.aist.go.jp (rqsmtp2.aist.go.jp [150.29.254.123]) by mx1.aist.go.jp with ESMTP id mA3CjLjb021613; Mon, 3 Nov 2008 21:45:21 +0900 (JST) env-from (handa@m17n.org) Original-Received: from smtp4.aist.go.jp by rqsmtp2.aist.go.jp with ESMTP id mA3CjLlf004948; Mon, 3 Nov 2008 21:45:21 +0900 (JST) env-from (handa@m17n.org) Original-Received: by smtp4.aist.go.jp with ESMTP id mA3CjKif013547; Mon, 3 Nov 2008 21:45:20 +0900 (JST) env-from (handa@m17n.org) Original-Received: from handa by etlken.m17n.org with local (Exim 4.69) (envelope-from ) id 1Kwyo4-0007Vt-Ai; Mon, 03 Nov 2008 21:45:20 +0900 In-reply-to: (message from Kenichi Handa on Mon, 03 Nov 2008 10:34:09 +0900) User-Agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/23.0.60 (i686-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO) X-detected-operating-system: by monty-python.gnu.org: Solaris 9 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:105294 Archived-At: In article , Kenichi Handa writes: > I'm now in Vietnam, and the Internet connection is very bad, > so here's a very short reply. I moved to another hotel, and the Internet connection is a little bit better here. :-) > In article , Eli Zaretskii writes: > > This fragment from etc/NEWS: > > The character code space is now 0x0..0x3FFFFF with no gap. > > Characters of code 0x0..0x10FFFF are Unicode characters of the same code points. > > Characters of code 0x3FFF80..0x3FFFFF are raw 8-bit bytes. > > seems to contradict itself: it says there's ``no gap'', but the codes > > between 0x110000 and 0x3FFF7F do constitute a gap, don't they? > Those are for character codes not unified with Unicode. I tried to rewrite nonascii.texi to clear the things. I finished upto the "Character Code" section as attached. What do you think about it? --- Kenichi Handa handa@ni.aist.go.jp @c -*-texinfo-*- @c This is part of the GNU Emacs Lisp Reference Manual. @c Copyright (C) 1998, 1999, 2001, 2002, 2003, 2004, @c 2005, 2006, 2007, 2008 Free Software Foundation, Inc. @c See the file elisp.texi for copying conditions. @setfilename ../../info/characters @node Non-ASCII Characters, Searching and Matching, Text, Top @chapter Non-@acronym{ASCII} Characters @cindex multibyte characters @cindex characters, multi-byte @cindex non-@acronym{ASCII} characters This chapter covers the special issues relating to non-@acronym{ASCII} characters and how they are stored in strings and buffers. @menu * Text Representations:: Unibyte and multibyte representations * Converting Representations:: Converting unibyte to multibyte and vice versa. * Selecting a Representation:: Treating a byte sequence as unibyte or multi. * Character Codes:: How unibyte and multibyte relate to codes of individual characters. * Character Sets:: The space of possible character codes is divided into various character sets. * Chars and Bytes:: More information about multibyte encodings. * Splitting Characters:: Converting a character to its byte sequence. * Scanning Charsets:: Which character sets are used in a buffer? * Translation of Characters:: Translation tables are used for conversion. * Coding Systems:: Coding systems are conversions for saving files. * Input Methods:: Input methods allow users to enter various non-ASCII characters without special keyboards. * Locales:: Interacting with the POSIX locale. @end menu @node Text Representations @section Text Representations @cindex text representations Emacs has two @dfn{text representations}---two ways to represent text in a string or buffer. These are called @dfn{unibyte} and @dfn{multibyte}. Each string, and each buffer, uses one of these two representations to store a sequence Emacs character. Emacs classifies characters into these three; @acronym{ASCII} characters, non-@acronym{ASCII} charcters, and 8-bit charcters. 8-bit characters correponds to raw bytes of 128 through 255. For detail, @xref{Character Codes}. @cindex unibyte text @cindex unibyte character In unibyte representation, each character occupies one byte and therefore the possible character codes range from 0 to 255. Codes 0 through 127 are @acronym{ASCII} characters; the codes from 128 through 255 are 8-bit charactes. Non-@acronym{ASCII} characters can not be stored in unibyte text. We call a character in unibyte text as unibyte character. @cindex leading code @cindex multibyte text @cindex multibyte character In multibyte representation, a character may occupy more than one byte, and as a result, the full range of Emacs character codes (#x0..#x3FFFFF) can be stored. @acronym{ASCII} characters occupy one byte, non-@acronym{ASCII} characters occupy two to five bytes (the first byte is in the range #xC2 through #xF8, and the remaining bytes are in the range #x80 through #xBF), and 8-bit characters occupy two bytes (the first byte is #xC0 or $xC2, and the second byte is in the range #x80 through #xBF). Actually this representation is the same as UTF-8 with extentions for non-Unicode characters and 8-bit characters. It is assured that a byte sequence that doesn't fit above never appears in this representation. In a buffer, the buffer-local value of the variable @code{enable-multibyte-characters} specifies the representation used. The representation for a string is determined and recorded in the string when the string is constructed. @defvar enable-multibyte-characters This variable specifies the current buffer's text representation. If it is non-@code{nil}, the buffer contains multibyte text; otherwise, it contains unibyte text. You cannot set this variable directly; instead, use the function @code{set-buffer-multibyte} to change a buffer's representation. @end defvar @defvar default-enable-multibyte-characters This variable's value is entirely equivalent to @code{(default-value 'enable-multibyte-characters)}, and setting this variable changes that default value. Setting the local binding of @code{enable-multibyte-characters} in a specific buffer is not allowed, but changing the default value is supported, and it is a reasonable thing to do, because it has no effect on existing buffers. The @samp{--unibyte} command line option does its job by setting the default value to @code{nil} early in startup. @end defvar @defun position-bytes position Return the byte-position corresponding to buffer position @var{position} in the current buffer. This is 1 at the start of the buffer, and counts upward in bytes. If @var{position} is out of range, the value is @code{nil}. @end defun @defun byte-to-position byte-position Return the buffer position corresponding to byte-position @var{byte-position} in the current buffer. If @var{byte-position} is out of range, the value is @code{nil}. If @var{byte-position} is not at a character boundary (in case of multibyte buffer), the value is the buffer position of the character that occupies @var{byte-position}. @end defun @defun multibyte-string-p string Return @code{t} if @var{string} is a multibyte string. @end defun @defun string-bytes string @cindex string, number of bytes This function returns the number of bytes in @var{string}. If @var{string} is a multibyte string, this can be greater than @code{(length @var{string})}. @end defun @node Converting Representations @section Converting Text Representations Emacs can convert unibyte text to multibyte; it can also convert multibyte text to unibyte provided that the multibyte text contains only @acronym{ASCII} and 8-bit characters. In general these conversions happen when inserting text into a buffer, or when putting text from several strings together in one string. You can also explicitly convert a string's contents to either representation. Emacs chooses the representation for a string based on the text that it is constructed from. The general rule is to convert unibyte text to multibyte text when combining it with other multibyte text, because the multibyte representation is more general and can hold whatever characters the unibyte text has. When inserting text into a buffer, Emacs converts the text to the buffer's representation, as specified by @code{enable-multibyte-characters} in that buffer. In particular, when you insert multibyte text into a unibyte buffer, Emacs converts the text to unibyte, even though this conversion cannot in general preserve all the characters that might be in the multibyte text. The other natural alternative, to convert the buffer contents to multibyte, is not acceptable because the buffer's representation is a choice made by the user that cannot be overridden automatically. Converting unibyte text to multibyte text leaves @acronym{ASCII} characters unchanged, and converts 8-bit characters (codes 128 through 159) to the corresponding representation for multibyte text. Converting multibyte text to unibyte is simpler: it discards all but the low 8 bits of each character code. It effectively converts all @acronym{ASCII} and 8-bit characters to the corresponding unibyte representation, but loose information for non-@acronym{ASCII} characters. Converting unibyte text to multibyte and back to unibyte reproduces the original unibyte text. The next three functions either return the argument @var{string}, or a newly created string with no text properties. @defun string-to-multibyte string This function returns a multibyte string containing the same sequence of characters as @var{string}. If @var{string} is a multibyte string, it is returned unchanged. @end defun @defun string-to-unibyte string This function returns a unibyte string containing the same sequence of characters as @var{string}. It signals an error if @var{string} contains a non-@acronym{ASCII} character. If @var{string} is a unibyte string, it is returned unchanged. @end defun @defun multibyte-char-to-unibyte char This convert the multibyte character @var{char} to a unibyte character. If @var{char} is a non-@acronym{ASCII} character, the value is -1. @end defun @defun unibyte-char-to-multibyte char This convert the unibyte character @var{char} to a multibyte character. @end defun @node Selecting a Representation @section Selecting a Representation Sometimes it is useful to examine an existing buffer or string as multibyte when it was unibyte, or vice versa. @defun set-buffer-multibyte multibyte Set the representation type of the current buffer. If @var{multibyte} is non-@code{nil}, the buffer becomes multibyte. If @var{multibyte} is @code{nil}, the buffer becomes unibyte. This function leaves the buffer contents unchanged when viewed as a sequence of bytes. As a consequence, it can change the contents viewed as characters; a sequence of three bytes which is treated as one character in multibyte representation will count as three characters in unibyte representation. 8-bit characters are an exception. They are represented by one byte in a unibyte buffer, but when the buffer is set to multibyte, they are converted to two-byte sequences, and vice versa. This function sets @code{enable-multibyte-characters} to record which representation is in use. It also adjusts various data in the buffer (including overlays, text properties and markers) so that they cover the same text as they did before. You cannot use @code{set-buffer-multibyte} on an indirect buffer, because indirect buffers always inherit the representation of the base buffer. @end defun @defun string-as-unibyte string This function returns a string with the same bytes as @var{string} but treating each byte as a character. This means that the value may have more characters than @var{string} has. 8-bit characters are an exception. Each of them is represented by two bytes in a multibyte string, but is converted to one byte. If @var{string} is already a unibyte string, then the value is @var{string} itself. Otherwise it is a newly created string, with no text properties. @end defun @defun string-as-multibyte string This function returns a string with the same bytes as @var{string} but treating each multibyte sequence as one character. This means that the value may have fewer characters than @var{string} has. If a byte sequence in @var{string} is invalid as a multibyte representation, each byte in the sequence is converted to two-byte multibyte representation of 8-bit characters. If @var{string} is already a multibyte string, then the value is @var{string} itself. Otherwise it is a newly created string, with no text properties. @end defun @node Character Codes @section Character Codes @cindex character codes The unibyte and multibyte text representations use different character codes. The valid character codes for unibyte representation range from 0 to 255---the values that can fit in one byte. The valid character codes for multibyte representation range from 0 to 4194303 (#x3FFFFF). In this code space, codes 0 through 127 are for @acronym{ASCII} charcters, codes 129 through 4194175 (#x3FFF7F) are for non-@acronym{ASCII} characters (among them, codes 0 through 1114111 (#10FFFF) corresponds to Unicode characters of the same codes), and codes 4194176 (#x3FFF80) through 4194303 (#x3FFFFF) are for 8-bit characters. @defun characterp charcode This returns @code{t} if @var{charcode} is a valid character, and @code{nil} otherwise. @example (characterp 65) @result{} t (characterp 4194303) @result{} t (characterp 4194304) @result{} nil @end example @end defun