From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Camm Maguire Newsgroups: gmane.emacs.devel,gmane.lisp.gcl.devel Subject: Re: utf8 and emacs text/string multibyte representation Date: Thu, 30 Oct 2014 10:13:20 -0400 Message-ID: <87bnotwsqn.fsf@maguirefamily.org> References: <87wq7jxc7d.fsf@gnu.org> <87zjcfx985.fsf_-_@maguirefamily.org> <83mw8f0w08.fsf@gnu.org> <87oasu3m72.fsf@maguirefamily.org> <83bnou26is.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1414779151 16373 80.91.229.3 (31 Oct 2014 18:12:31 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 31 Oct 2014 18:12:31 +0000 (UTC) Cc: gcl-devel@gnu.org, emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Oct 31 19:12:24 2014 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1XkGgS-0003bQ-NL for ged-emacs-devel@m.gmane.org; Fri, 31 Oct 2014 19:12:24 +0100 Original-Received: from localhost ([::1]:40541 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XkGgS-0008Hy-2O for ged-emacs-devel@m.gmane.org; Fri, 31 Oct 2014 14:12:24 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:50643) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XkEOk-0005ZB-K3 for emacs-devel@gnu.org; Fri, 31 Oct 2014 11:47:17 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XjqUC-0004d0-UE for emacs-devel@gnu.org; Thu, 30 Oct 2014 10:14:06 -0400 Original-Received: from vms173019pub.verizon.net ([206.46.173.19]:23689) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XjqU5-0004bE-4m; Thu, 30 Oct 2014 10:13:53 -0400 Original-Received: from localhost.m.enhanced.com ([173.61.191.70]) by vms173019.mailsrvcs.net (Oracle Communications Messaging Server 7.0.5.32.0 64bit (built Jul 16 2014)) with ESMTPA id <0NE900IR2GUD5J90@vms173019.mailsrvcs.net>; Thu, 30 Oct 2014 09:13:25 -0500 (CDT) X-CMAE-Score: 0 X-CMAE-Analysis: v=2.1 cv=V7nKCljn c=1 sm=1 tr=0 a=/u9AJkq9Lu4W7WiJwJyTEw==:117 a=1r3tstjE1_UA:10 a=LdTvEE7h3esA:10 a=kj9zAlcOel0A:10 a=9N09Ue-cAAAA:8 a=85uBIQG4AAAA:8 a=oR5dmqMzAAAA:8 a=-9mUelKeXuEA:10 a=mDV3o1hIAAAA:8 a=BdhenXefBtrp88wYwkgA:9 a=CjuIK1q_8ugA:10 Original-Received: from camm by localhost.m.enhanced.com with local (Exim 4.80) (envelope-from ) id 1XjqTZ-0007gQ-0f; Thu, 30 Oct 2014 10:13:21 -0400 In-reply-to: <83bnou26is.fsf@gnu.org> (Eli Zaretskii's message of "Wed, 29 Oct 2014 18:19:07 +0200") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux) X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 206.46.173.19 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:176117 gmane.lisp.gcl.devel:8798 Archived-At: Greetings! Eli Zaretskii writes: >> From: Camm Maguire >> Cc: emacs-devel@gnu.org, gcl-devel@gnu.org >> Date: Wed, 29 Oct 2014 11:55:13 -0400 >> >> Does every string access in emacs proceed through the utf8 decoder? > > If you need to look at the character, yes. E.g., if you need some > property of the character, you need to index the appropriate table by > that character's codepoint. But in most operations that is not > needed. You just need to recognize several specific characters, like > the null character, the slash, etc., most of which are ASCII. > Do you allocate a fresh boxed character on each aref, or output an integer referring to a fixed ~2^22 sized table? Do you maintain such a table in core? >> >> A cached internal pointer storing the last referenced codepoint >> >> offset makes access essentially O(1). >> > >> > We indeed maintain a cache for byte-to-character and character-to-byte >> > conversions. >> >> How big is this cache? > > Its size is dynamic, and depends on how frequently the conversion is > needed in places that are far away. The cache stores byte-to-char > correspondence in places that are far away, and Emacs uses binary > search in between them. > How far is 'far away'? If you had this to do all over again, would you still opt for the multibyte? While you have buffers to consider too, which probably relate to strings, it seems to me that the dominant costs are always memory allocation/gc related, making the memory footprint important but not at the expense of allocating characters, and that the most frequent operations are removals/pattern substitutions, which can proceed bytewise with the same gc overhead. GCL also supports regular expressions -- how is this modified for utf-8? Take care, -- Camm Maguire camm@maguirefamily.org ========================================================================== "The earth is but one country, and mankind its citizens." -- Baha'u'llah