From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Raymond Toy Newsgroups: gmane.lisp.gcl.devel,gmane.emacs.devel Subject: Re: utf8 and emacs text/string multibyte representation Date: Wed, 29 Oct 2014 08:56:55 -0700 Message-ID: References: <87wq7jxc7d.fsf@gnu.org> <87zjcfx985.fsf_-_@maguirefamily.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1414598276 31965 80.91.229.3 (29 Oct 2014 15:57:56 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 29 Oct 2014 15:57:56 +0000 (UTC) Cc: emacs-devel@gnu.org To: gcl-devel@gnu.org Original-X-From: gcl-devel-bounces+gnu-gcl-devel=m.gmane.org@gnu.org Wed Oct 29 16:57:51 2014 Return-path: Envelope-to: gnu-gcl-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1XjVd7-00051j-6l for gnu-gcl-devel@m.gmane.org; Wed, 29 Oct 2014 16:57:49 +0100 Original-Received: from localhost ([::1]:46982 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XjVd6-0001TC-Ta for gnu-gcl-devel@m.gmane.org; Wed, 29 Oct 2014 11:57:48 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:42709) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XjVcz-0001T6-Vq for gcl-devel@gnu.org; Wed, 29 Oct 2014 11:57:48 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XjVct-00031G-Sw for gcl-devel@gnu.org; Wed, 29 Oct 2014 11:57:41 -0400 Original-Received: from plane.gmane.org ([80.91.229.3]:59726) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XjVct-00030i-N0 for gcl-devel@gnu.org; Wed, 29 Oct 2014 11:57:35 -0400 Original-Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1XjVcj-0004lu-On for gcl-devel@gnu.org; Wed, 29 Oct 2014 16:57:25 +0100 Original-Received: from c-24-7-123-160.hsd1.ca.comcast.net ([24.7.123.160]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 29 Oct 2014 16:57:25 +0100 Original-Received: from toy.raymond by c-24-7-123-160.hsd1.ca.comcast.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 29 Oct 2014 16:57:25 +0100 X-Injected-Via-Gmane: http://gmane.org/ Original-Lines: 29 Original-X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: c-24-7-123-160.hsd1.ca.comcast.net User-Agent: Gnus/5.101 (Gnus v5.10.10) XEmacs/21.5-b34 (darwin) Cancel-Lock: sha1:wHy9v032K3/yUWsOQPinnRroubY= X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 80.91.229.3 X-BeenThere: gcl-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcl-devel-bounces+gnu-gcl-devel=m.gmane.org@gnu.org Original-Sender: gcl-devel-bounces+gnu-gcl-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.gcl.devel:8787 gmane.emacs.devel:176015 Archived-At: >>>>> "Camm" == Camm Maguire writes: Camm> Greetings! I've recently been considering supporting unicode in gcl by Camm> representing strings internally in utf8. It appears that emacs does the Camm> same or similar. Apart from the obvious memory footprint benefits, I'd Camm> like to ask what other advantages/disadvantages have been discovered. Camm> Much of the utf8 literature emphasizes that most algorithms can proceed Camm> conventionally in byte-wise fashion, including lexicographical ordering Camm> comparisons, given that almost all jobs are sequential, at least Camm> initially. A cached internal pointer storing the last referenced Camm> codepoint offset makes access essentially O(1). Yet setting string Camm> elements can trigger reallocations/memmove operations. While these can Camm> be aggregated over the setting of multiple elements, operations like Camm> nreverse look ridiculous if left in terms of calls to aref and aset. Camm> Thoughts, advice and experiences most appreciated. Have you looked at what other Lisp implementations do? AFAIK, none use utf-8. CCL and clisp use utf-32, cmucl and allegro use utf-16, sbcl and ecl(?) have two string types: 8-bit base-string and 32-bit strings. As a one-man operation (unfortunately), I'd go with the easiest one to get right and follow either ccl or cmucl. The rest of the support for unicode can be added with libraries like cl-unicode and/or babel, if need be. -- Ray