From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel,gmane.lisp.gcl.devel Subject: Re: utf8 and emacs text/string multibyte representation Date: Wed, 29 Oct 2014 16:51:35 +0200 Message-ID: <83mw8f0w08.fsf@gnu.org> References: <87wq7jxc7d.fsf@gnu.org> <87zjcfx985.fsf_-_@maguirefamily.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org X-Trace: ger.gmane.org 1414594340 24572 80.91.229.3 (29 Oct 2014 14:52:20 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 29 Oct 2014 14:52:20 +0000 (UTC) Cc: gcl-devel@gnu.org, emacs-devel@gnu.org To: Camm Maguire Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Oct 29 15:52:13 2014 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1XjUbd-0005jz-Ky for ged-emacs-devel@m.gmane.org; Wed, 29 Oct 2014 15:52:13 +0100 Original-Received: from localhost ([::1]:46648 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XjUbc-0003J4-9O for ged-emacs-devel@m.gmane.org; Wed, 29 Oct 2014 10:52:12 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:54511) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XjUbE-0003F7-QT for emacs-devel@gnu.org; Wed, 29 Oct 2014 10:51:54 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XjUb8-0005nZ-Dk for emacs-devel@gnu.org; Wed, 29 Oct 2014 10:51:48 -0400 Original-Received: from mtaout20.012.net.il ([80.179.55.166]:36755) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XjUb8-0005nU-15; Wed, 29 Oct 2014 10:51:42 -0400 Original-Received: from conversion-daemon.a-mtaout20.012.net.il by a-mtaout20.012.net.il (HyperSendmail v2007.08) id <0NE700B00NRTRR00@a-mtaout20.012.net.il>; Wed, 29 Oct 2014 16:51:40 +0200 (IST) Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout20.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0NE700BUHNY33Q90@a-mtaout20.012.net.il>; Wed, 29 Oct 2014 16:51:40 +0200 (IST) In-reply-to: <87zjcfx985.fsf_-_@maguirefamily.org> X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: Solaris 10 X-Received-From: 80.179.55.166 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:176003 gmane.lisp.gcl.devel:8785 Archived-At: > From: Camm Maguire > Date: Wed, 29 Oct 2014 10:04:58 -0400 > > Greetings! I've recently been considering supporting unicode in gcl by > representing strings internally in utf8. It appears that emacs does the > same or similar. If you haven't already, you can find some basic description of what Emacs does in the node "Text Representations" of the ELisp manual. > Apart from the obvious memory footprint benefits, I'd > like to ask what other advantages/disadvantages have been discovered. You have basically said it yourself: memory footprint vs addressability. If you want to discuss this in more detail, I suggest to ask more specific questions about specific aspects that bother you. > A cached internal pointer storing the last referenced codepoint > offset makes access essentially O(1). We indeed maintain a cache for byte-to-character and character-to-byte conversions. > Yet setting string elements can trigger reallocations/memmove > operations. Emacs, as every editor, needs to handle this efficiently anyway, because editing operations rarely leave the buffer size unchanged. So Emacs uses a gap to minimize reallocations. > While these can be aggregated over the setting of multiple elements, > operations like nreverse look ridiculous if left in terms of calls > to aref and aset. nreverse applied to a string is a rarity, IME.