From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: "Stephen J. Turnbull" <stephen@xemacs.org>
Newsgroups: gmane.emacs.devel,gmane.lisp.gcl.devel
Subject: Re: utf8 and emacs text/string multibyte representation
Date: Sun, 02 Nov 2014 03:32:15 +0900
Message-ID: <87mw8a22mo.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <jwvioj39hx0.fsf-monnier+emacs@gnu.org> <87wq7jxc7d.fsf@gnu.org>
	<87zjcfx985.fsf_-_@maguirefamily.org> <83mw8f0w08.fsf@gnu.org>
	<87oasu3m72.fsf@maguirefamily.org> <83bnou26is.fsf@gnu.org>
	<87bnotwsqn.fsf@maguirefamily.org> <83y4rxzgmm.fsf@gnu.org>
	<87lhnxo73l.fsf@maguirefamily.org> <83wq7hzf9t.fsf@gnu.org>
	<87h9ykazdr.fsf@maguirefamily.org> <831tpnz442.fsf@gnu.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
X-Trace: ger.gmane.org 1414866783 25044 80.91.229.3 (1 Nov 2014 18:33:03 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Sat, 1 Nov 2014 18:33:03 +0000 (UTC)
Cc: Camm Maguire <camm@maguirefamily.org>, gcl-devel@gnu.org,
	emacs-devel@gnu.org
To: Eli Zaretskii <eliz@gnu.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Nov 01 19:32:56 2014
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1XkdTr-0003T5-QD
	for ged-emacs-devel@m.gmane.org; Sat, 01 Nov 2014 19:32:55 +0100
Original-Received: from localhost ([::1]:53436 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1XkdTr-0007P9-DR
	for ged-emacs-devel@m.gmane.org; Sat, 01 Nov 2014 14:32:55 -0400
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:51891)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <stephen@xemacs.org>) id 1XkdTX-0007OK-9X
	for emacs-devel@gnu.org; Sat, 01 Nov 2014 14:32:42 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <stephen@xemacs.org>) id 1XkdTP-00088U-PS
	for emacs-devel@gnu.org; Sat, 01 Nov 2014 14:32:35 -0400
Original-Received: from shako.sk.tsukuba.ac.jp ([130.158.97.161]:32856)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <stephen@xemacs.org>)
	id 1XkdTH-000884-ET; Sat, 01 Nov 2014 14:32:19 -0400
Original-Received: from uwakimon.sk.tsukuba.ac.jp (uwakimon.sk.tsukuba.ac.jp
	[130.158.99.156])
	by shako.sk.tsukuba.ac.jp (Postfix) with ESMTP id 5DFD51C3A50;
	Sun,  2 Nov 2014 03:32:15 +0900 (JST)
Original-Received: by uwakimon.sk.tsukuba.ac.jp (Postfix, from userid 1000)
	id 51CA31A27CF; Sun,  2 Nov 2014 03:32:15 +0900 (JST)
In-Reply-To: <831tpnz442.fsf@gnu.org>
X-Mailer: VM undefined under 21.5  (beta34) "kale" acf1c26e3019 XEmacs Lucid
	(x86_64-unknown-linux)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x
X-Received-From: 130.158.97.161
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:176219 gmane.lisp.gcl.devel:8816
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/176219>

Eli Zaretskii writes:

 > > Been discussing this elsewhere, and its come to my attention that not
 > > only do all unicode code-points not fit into UTF-16, but all unicode
 > > characters don't fit into unicode code-points :-).  Presumably this is
 > > why emacs expanded to 22bits?
 > 
 > Not sure what you mean here.  All Unicode characters do fit into the
 > Unicode codepoint space.  Emacs extends that codepoint space beyond 22
 > bits because it needs to support cultures which don't want unification
 > yet.

I suppose he means grapheme complexes, such as various accented
characters that can be constructed from composing characters but do
not have precomposed forms in Unicode.  As you say, that's not why
Emacs extended the code space.

 > > Did you consider leaving aref, char-code and code-char alone and writing
 > > unicode functions on top of these, i.e. unicode-length!=length, as
 > > opposed to making aref itself do this translation under the hood,
 > > thereby violating the expectation of O(1) access, (which is certainly
 > > offered in other kinds of arrays, though it is questionable whether real
 > > users actually expect this for strings)?

Actually, originally Emacs allowed you to treat text (buffers and
strings) either as sequences of characters or arrays of bytes, and
this was a real bug-breeder (and why XEmacs chose the pain of the
incompatible separation of integer type from character type).

I'm not sure if the feature is present in modern Emacs, but at the
very least the usage is so rare today that I'm unaware of any.

That's not what you asked, but it implies the answer "no, and you
shouldn't, either" to your question.  This is despite the fact that
yes, in many languages and applications users *do* expect O(1) access
to individual characters in text.