From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.devel,gmane.lisp.gcl.devel
Subject: Re: utf8 and emacs text/string multibyte representation
Date: Sat, 01 Nov 2014 11:01:33 +0200
Message-ID: <831tpnz442.fsf@gnu.org>
References: <jwvioj39hx0.fsf-monnier+emacs@gnu.org> <87wq7jxc7d.fsf@gnu.org>
	<87zjcfx985.fsf_-_@maguirefamily.org> <83mw8f0w08.fsf@gnu.org>
	<87oasu3m72.fsf@maguirefamily.org> <83bnou26is.fsf@gnu.org>
	<87bnotwsqn.fsf@maguirefamily.org> <83y4rxzgmm.fsf@gnu.org>
	<87lhnxo73l.fsf@maguirefamily.org> <83wq7hzf9t.fsf@gnu.org>
	<87h9ykazdr.fsf@maguirefamily.org>
Reply-To: Eli Zaretskii <eliz@gnu.org>
NNTP-Posting-Host: plane.gmane.org
X-Trace: ger.gmane.org 1414832585 16150 80.91.229.3 (1 Nov 2014 09:03:05 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Sat, 1 Nov 2014 09:03:05 +0000 (UTC)
Cc: gcl-devel@gnu.org, emacs-devel@gnu.org
To: Camm Maguire <camm@maguirefamily.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Nov 01 10:02:58 2014
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1XkUaH-00043W-UH
	for ged-emacs-devel@m.gmane.org; Sat, 01 Nov 2014 10:02:58 +0100
Original-Received: from localhost ([::1]:49464 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1XkUaH-0000LU-IH
	for ged-emacs-devel@m.gmane.org; Sat, 01 Nov 2014 05:02:57 -0400
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:48124)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1XkUZI-00082P-Ju
	for emacs-devel@gnu.org; Sat, 01 Nov 2014 05:02:02 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1XkUZD-000513-BL
	for emacs-devel@gnu.org; Sat, 01 Nov 2014 05:01:56 -0400
Original-Received: from mtaout20.012.net.il ([80.179.55.166]:58914)
	by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <eliz@gnu.org>)
	id 1XkUZD-00050I-2p; Sat, 01 Nov 2014 05:01:51 -0400
Original-Received: from conversion-daemon.a-mtaout20.012.net.il by
	a-mtaout20.012.net.il (HyperSendmail v2007.08) id
	<0NEC00M00RJXS000@a-mtaout20.012.net.il>;
	Sat, 01 Nov 2014 11:01:49 +0200 (IST)
Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout20.012.net.il
	(HyperSendmail v2007.08) with ESMTPA id
	<0NEC00M5HRR0CK70@a-mtaout20.012.net.il>;
	Sat, 01 Nov 2014 11:01:49 +0200 (IST)
In-reply-to: <87h9ykazdr.fsf@maguirefamily.org>
X-012-Sender: halo1@inter.net.il
X-detected-operating-system: by eggs.gnu.org: Solaris 10
X-Received-From: 80.179.55.166
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:176195 gmane.lisp.gcl.devel:8806
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/176195>

> From: Camm Maguire <camm@maguirefamily.org>
> Cc: emacs-devel@gnu.org,  gcl-devel@gnu.org
> Date: Fri, 31 Oct 2014 14:05:20 -0400
> 
> Been discussing this elsewhere, and its come to my attention that not
> only do all unicode code-points not fit into UTF-16, but all unicode
> characters don't fit into unicode code-points :-).  Presumably this is
> why emacs expanded to 22bits?

Not sure what you mean here.  All Unicode characters do fit into the
Unicode codepoint space.  Emacs extends that codepoint space beyond 22
bits because it needs to support cultures which don't want unification
yet.

> If this is indeed the case, all these encodings have the same problems
> though varying in degree, and UTF-8 is clearly the smallest and most
> ascii compatible.  The question then arises as to whether lisp
> characters, which by definition do offer random access in strings, need
> be the same as or close to unicode characters.  

In Emacs, they are the same, yes.  Anything else means considerable
complications, AFAIR.

Random access to strings on the Lisp level is implemented as a
function on the C level, which simply walks the UTF-8 representation
one character at a time.  UTF-8 makes it easy to determine the number
of bytes by the first byte, so you compute that and move that many
bytes.

Emacs includes optimizations for a popular use case when each
character is a single byte (as in pure ASCII strings).  It also
records the last string used in aref and the last character and the
corresponding byte accessed in that string.  So if the Lisp program
access several characters of the same string that are close to each
other, the 2nd and subsequent calls to aref are much cheaper, because
they start from a closer starting point.

> Did you consider leaving aref, char-code and code-char alone and writing
> unicode functions on top of these, i.e. unicode-length!=length, as
> opposed to making aref itself do this translation under the hood,
> thereby violating the expectation of O(1) access, (which is certainly
> offered in other kinds of arrays, though it is questionable whether real
> users actually expect this for strings)?

What would be the benefit of having such byte-oriented aref?  Lisp
code needs to manipulate characters, not bytes.  Having byte-oriented
aref would just push the translation to characters to the Lisp level,
something no Lisp application wants or should want doing.

Internally, on the C level, Emacs does have access to individual
bytes, of course.  On that level, each string is indeed
byte-addressable at O(1) complexity.

> In doing so, one would then know that aref is random-access, and
> unicode-??? is sequential only.

As explained above, the access to characters is not really sequential
in Emacs, except for the first character of a string that was not
accessed yet.