From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Camm Maguire Newsgroups: gmane.lisp.gcl.devel,gmane.emacs.devel Subject: Re: utf8 and emacs text/string multibyte representation Date: Fri, 31 Oct 2014 14:05:20 -0400 Message-ID: <87h9ykazdr.fsf@maguirefamily.org> References: <87wq7jxc7d.fsf@gnu.org> <87zjcfx985.fsf_-_@maguirefamily.org> <83mw8f0w08.fsf@gnu.org> <87oasu3m72.fsf@maguirefamily.org> <83bnou26is.fsf@gnu.org> <87bnotwsqn.fsf@maguirefamily.org> <83y4rxzgmm.fsf@gnu.org> <87lhnxo73l.fsf@maguirefamily.org> <83wq7hzf9t.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1414785215 21896 80.91.229.3 (31 Oct 2014 19:53:35 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 31 Oct 2014 19:53:35 +0000 (UTC) Cc: gcl-devel@gnu.org, emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: gcl-devel-bounces+gnu-gcl-devel=m.gmane.org@gnu.org Fri Oct 31 20:53:29 2014 Return-path: Envelope-to: gnu-gcl-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1XkIGG-00047P-Uy for gnu-gcl-devel@m.gmane.org; Fri, 31 Oct 2014 20:53:29 +0100 Original-Received: from localhost ([::1]:42176 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XkIGG-0008P1-Hv for gnu-gcl-devel@m.gmane.org; Fri, 31 Oct 2014 15:53:28 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:32798) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XkGaf-0005Hh-Ix for gcl-devel@gnu.org; Fri, 31 Oct 2014 14:06:30 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XkGaa-0005kk-Bd for gcl-devel@gnu.org; Fri, 31 Oct 2014 14:06:25 -0400 Original-Received: from vms173019pub.verizon.net ([206.46.173.19]:34164) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XkGaI-0005gy-7q; Fri, 31 Oct 2014 14:06:02 -0400 Original-Received: from localhost.m.enhanced.com ([173.61.191.70]) by vms173019.mailsrvcs.net (Oracle Communications Messaging Server 7.0.5.32.0 64bit (built Jul 16 2014)) with ESMTPA id <0NEB00FKCM901R70@vms173019.mailsrvcs.net>; Fri, 31 Oct 2014 13:05:43 -0500 (CDT) X-CMAE-Score: 0 X-CMAE-Analysis: v=2.1 cv=GLe/yVJP c=1 sm=1 tr=0 a=/u9AJkq9Lu4W7WiJwJyTEw==:117 a=1r3tstjE1_UA:10 a=LdTvEE7h3esA:10 a=kj9zAlcOel0A:10 a=9N09Ue-cAAAA:8 a=85uBIQG4AAAA:8 a=oR5dmqMzAAAA:8 a=-9mUelKeXuEA:10 a=mDV3o1hIAAAA:8 a=ZAGHukeXuiKxStLX9qkA:9 a=PzxVPkERg0I16-ZP:21 a=ZFWHTWuafX_3CQRF:21 a=CjuIK1q_8ugA:10 Original-Received: from camm by localhost.m.enhanced.com with local (Exim 4.80) (envelope-from ) id 1XkGZc-0000h5-BD; Fri, 31 Oct 2014 14:05:20 -0400 In-reply-to: <83wq7hzf9t.fsf@gnu.org> (Eli Zaretskii's message of "Thu, 30 Oct 2014 18:35:58 +0200") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux) X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 206.46.173.19 X-BeenThere: gcl-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcl-devel-bounces+gnu-gcl-devel=m.gmane.org@gnu.org Original-Sender: gcl-devel-bounces+gnu-gcl-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.gcl.devel:8800 gmane.emacs.devel:176148 Archived-At: Thanks so much! Been discussing this elsewhere, and its come to my attention that not only do all unicode code-points not fit into UTF-16, but all unicode characters don't fit into unicode code-points :-). Presumably this is why emacs expanded to 22bits? In any case, it makes clear what one correspondent said, that unicode must be processed sequentially, so there is no real reason to struggle to get random O(1) access to unicode characters. If this is indeed the case, all these encodings have the same problems though varying in degree, and UTF-8 is clearly the smallest and most ascii compatible. The question then arises as to whether lisp characters, which by definition do offer random access in strings, need be the same as or close to unicode characters. Did you consider leaving aref, char-code and code-char alone and writing unicode functions on top of these, i.e. unicode-length!=length, as opposed to making aref itself do this translation under the hood, thereby violating the expectation of O(1) access, (which is certainly offered in other kinds of arrays, though it is questionable whether real users actually expect this for strings)? In doing so, one would then know that aref is random-access, and unicode-??? is sequential only. Take care, Eli Zaretskii writes: >> From: Camm Maguire >> Cc: emacs-devel@gnu.org, gcl-devel@gnu.org >> Date: Thu, 30 Oct 2014 12:27:58 -0400 >> >> > I'm not sure what you mean by a "boxed character". A character in >> > Emacs is just an int. >> > >> >> Then how do you distinguish integers from characters at the lisp level? > > We don't -- except that a valid character's value must fit the Unicode > range. > > There's no character data type in Emacs. (XEmacs does have it.) > > > > -- Camm Maguire camm@maguirefamily.org ========================================================================== "The earth is but one country, and mankind its citizens." -- Baha'u'llah