From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel,gmane.lisp.gcl.devel Subject: Re: utf8 and emacs text/string multibyte representation Date: Sat, 01 Nov 2014 11:01:33 +0200 Message-ID: <831tpnz442.fsf@gnu.org> References: <87wq7jxc7d.fsf@gnu.org> <87zjcfx985.fsf_-_@maguirefamily.org> <83mw8f0w08.fsf@gnu.org> <87oasu3m72.fsf@maguirefamily.org> <83bnou26is.fsf@gnu.org> <87bnotwsqn.fsf@maguirefamily.org> <83y4rxzgmm.fsf@gnu.org> <87lhnxo73l.fsf@maguirefamily.org> <83wq7hzf9t.fsf@gnu.org> <87h9ykazdr.fsf@maguirefamily.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org X-Trace: ger.gmane.org 1414832585 16150 80.91.229.3 (1 Nov 2014 09:03:05 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 1 Nov 2014 09:03:05 +0000 (UTC) Cc: gcl-devel@gnu.org, emacs-devel@gnu.org To: Camm Maguire Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Nov 01 10:02:58 2014 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1XkUaH-00043W-UH for ged-emacs-devel@m.gmane.org; Sat, 01 Nov 2014 10:02:58 +0100 Original-Received: from localhost ([::1]:49464 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XkUaH-0000LU-IH for ged-emacs-devel@m.gmane.org; Sat, 01 Nov 2014 05:02:57 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:48124) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XkUZI-00082P-Ju for emacs-devel@gnu.org; Sat, 01 Nov 2014 05:02:02 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XkUZD-000513-BL for emacs-devel@gnu.org; Sat, 01 Nov 2014 05:01:56 -0400 Original-Received: from mtaout20.012.net.il ([80.179.55.166]:58914) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XkUZD-00050I-2p; Sat, 01 Nov 2014 05:01:51 -0400 Original-Received: from conversion-daemon.a-mtaout20.012.net.il by a-mtaout20.012.net.il (HyperSendmail v2007.08) id <0NEC00M00RJXS000@a-mtaout20.012.net.il>; Sat, 01 Nov 2014 11:01:49 +0200 (IST) Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout20.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0NEC00M5HRR0CK70@a-mtaout20.012.net.il>; Sat, 01 Nov 2014 11:01:49 +0200 (IST) In-reply-to: <87h9ykazdr.fsf@maguirefamily.org> X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: Solaris 10 X-Received-From: 80.179.55.166 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:176195 gmane.lisp.gcl.devel:8806 Archived-At: > From: Camm Maguire > Cc: emacs-devel@gnu.org, gcl-devel@gnu.org > Date: Fri, 31 Oct 2014 14:05:20 -0400 > > Been discussing this elsewhere, and its come to my attention that not > only do all unicode code-points not fit into UTF-16, but all unicode > characters don't fit into unicode code-points :-). Presumably this is > why emacs expanded to 22bits? Not sure what you mean here. All Unicode characters do fit into the Unicode codepoint space. Emacs extends that codepoint space beyond 22 bits because it needs to support cultures which don't want unification yet. > If this is indeed the case, all these encodings have the same problems > though varying in degree, and UTF-8 is clearly the smallest and most > ascii compatible. The question then arises as to whether lisp > characters, which by definition do offer random access in strings, need > be the same as or close to unicode characters. In Emacs, they are the same, yes. Anything else means considerable complications, AFAIR. Random access to strings on the Lisp level is implemented as a function on the C level, which simply walks the UTF-8 representation one character at a time. UTF-8 makes it easy to determine the number of bytes by the first byte, so you compute that and move that many bytes. Emacs includes optimizations for a popular use case when each character is a single byte (as in pure ASCII strings). It also records the last string used in aref and the last character and the corresponding byte accessed in that string. So if the Lisp program access several characters of the same string that are close to each other, the 2nd and subsequent calls to aref are much cheaper, because they start from a closer starting point. > Did you consider leaving aref, char-code and code-char alone and writing > unicode functions on top of these, i.e. unicode-length!=length, as > opposed to making aref itself do this translation under the hood, > thereby violating the expectation of O(1) access, (which is certainly > offered in other kinds of arrays, though it is questionable whether real > users actually expect this for strings)? What would be the benefit of having such byte-oriented aref? Lisp code needs to manipulate characters, not bytes. Having byte-oriented aref would just push the translation to characters to the Lisp level, something no Lisp application wants or should want doing. Internally, on the C level, Emacs does have access to individual bytes, of course. On that level, each string is indeed byte-addressable at O(1) complexity. > In doing so, one would then know that aref is random-access, and > unicode-??? is sequential only. As explained above, the access to characters is not really sequential in Emacs, except for the first character of a string that was not accessed yet.