From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: David Kastrup Newsgroups: gmane.emacs.devel,gmane.lisp.gcl.devel Subject: Re: utf8 and emacs text/string multibyte representation Date: Sat, 01 Nov 2014 19:41:22 +0100 Organization: Organization?!? Message-ID: <87y4ru227h.fsf@fencepost.gnu.org> References: <87wq7jxc7d.fsf@gnu.org> <87zjcfx985.fsf_-_@maguirefamily.org> <83mw8f0w08.fsf@gnu.org> <87oasu3m72.fsf@maguirefamily.org> <83bnou26is.fsf@gnu.org> <87bnotwsqn.fsf@maguirefamily.org> <83y4rxzgmm.fsf@gnu.org> <87lhnxo73l.fsf@maguirefamily.org> <83wq7hzf9t.fsf@gnu.org> <87h9ykazdr.fsf@maguirefamily.org> <831tpnz442.fsf@gnu.org> <87mw8a22mo.fsf@uwakimon.sk.tsukuba.ac.jp> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1414867353 1102 80.91.229.3 (1 Nov 2014 18:42:33 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 1 Nov 2014 18:42:33 +0000 (UTC) Cc: gcl-devel@gnu.org To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Nov 01 19:42:27 2014 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Xkdd4-0008Fh-Qn for ged-emacs-devel@m.gmane.org; Sat, 01 Nov 2014 19:42:26 +0100 Original-Received: from localhost ([::1]:53477 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Xkdd4-0000j9-Ez for ged-emacs-devel@m.gmane.org; Sat, 01 Nov 2014 14:42:26 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:53165) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Xkdcm-0000j3-8G for emacs-devel@gnu.org; Sat, 01 Nov 2014 14:42:14 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Xkdcd-00026v-0V for emacs-devel@gnu.org; Sat, 01 Nov 2014 14:42:08 -0400 Original-Received: from plane.gmane.org ([80.91.229.3]:53413) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Xkdcc-00023h-Py for emacs-devel@gnu.org; Sat, 01 Nov 2014 14:41:58 -0400 Original-Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1XkdcZ-0007wT-SC for emacs-devel@gnu.org; Sat, 01 Nov 2014 19:41:55 +0100 Original-Received: from x2f517af.dyn.telefonica.de ([2.245.23.175]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 01 Nov 2014 19:41:55 +0100 Original-Received: from dak by x2f517af.dyn.telefonica.de with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 01 Nov 2014 19:41:55 +0100 X-Injected-Via-Gmane: http://gmane.org/ Original-Lines: 43 Original-X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: x2f517af.dyn.telefonica.de X-Face: 2FEFf>]>q>2iw=B6, xrUubRI>pR&Ml9=ao@P@i)L:\urd*t9M~y1^:+Y]'C0~{mAl`oQuAl \!3KEIp?*w`|bL5qr,H)LFO6Q=qx~iH4DN; i"; /yuIsqbLLCh/!U#X[S~(5eZ41to5f%E@'ELIi$t^ Vc\LWP@J5p^rst0+('>Er0=^1{]M9!p?&:\z]|;&=NP3AhB!B_bi^]Pfkw User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.0.50 (gnu/linux) Cancel-Lock: sha1:8H4DbIKe2jipnOar0fzPekn0BNg= X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 80.91.229.3 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:176220 gmane.lisp.gcl.devel:8817 Archived-At: "Stephen J. Turnbull" writes: > Eli Zaretskii writes: > > > > Been discussing this elsewhere, and its come to my attention that not > > > only do all unicode code-points not fit into UTF-16, but all unicode > > > characters don't fit into unicode code-points :-). Presumably this is > > > why emacs expanded to 22bits? > > > > Not sure what you mean here. All Unicode characters do fit into the > > Unicode codepoint space. Emacs extends that codepoint space beyond 22 > > bits because it needs to support cultures which don't want unification > > yet. > > I suppose he means grapheme complexes, such as various accented > characters that can be constructed from composing characters but do > not have precomposed forms in Unicode. As you say, that's not why > Emacs extended the code space. > > > > Did you consider leaving aref, char-code and code-char alone and writing > > > unicode functions on top of these, i.e. unicode-length!=length, as > > > opposed to making aref itself do this translation under the hood, > > > thereby violating the expectation of O(1) access, (which is certainly > > > offered in other kinds of arrays, though it is questionable whether real > > > users actually expect this for strings)? > > Actually, originally Emacs allowed you to treat text (buffers and > strings) either as sequences of characters or arrays of bytes, and > this was a real bug-breeder (and why XEmacs chose the pain of the > incompatible separation of integer type from character type). > > I'm not sure if the feature is present in modern Emacs, but at the > very least the usage is so rare today that I'm unaware of any. string-as-unibyte and string-as-multibyte most certainly are available for going from one to the other. But the commands working on either unibyte or multibyte strings are the same. Similar for buffers. I have no idea whether this is a problem vector for creating inconsistent multibyte content. I could imagine it to be, but so could be user-created CCL programs for code conversion. -- David Kastrup