From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel,gmane.lisp.gcl.devel Subject: Re: utf8 and emacs text/string multibyte representation Date: Thu, 30 Oct 2014 18:06:41 +0200 Message-ID: <83y4rxzgmm.fsf@gnu.org> References: <87wq7jxc7d.fsf@gnu.org> <87zjcfx985.fsf_-_@maguirefamily.org> <83mw8f0w08.fsf@gnu.org> <87oasu3m72.fsf@maguirefamily.org> <83bnou26is.fsf@gnu.org> <87bnotwsqn.fsf@maguirefamily.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org X-Trace: ger.gmane.org 1414781369 24392 80.91.229.3 (31 Oct 2014 18:49:29 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 31 Oct 2014 18:49:29 +0000 (UTC) Cc: gcl-devel@gnu.org, emacs-devel@gnu.org To: Camm Maguire Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Oct 31 19:49:22 2014 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1XkHGD-0000tD-Og for ged-emacs-devel@m.gmane.org; Fri, 31 Oct 2014 19:49:21 +0100 Original-Received: from localhost ([::1]:41133 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XkHGD-0003mt-78 for ged-emacs-devel@m.gmane.org; Fri, 31 Oct 2014 14:49:21 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:51180) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XkEIt-0006Cd-Lt for emacs-devel@gnu.org; Fri, 31 Oct 2014 11:40:47 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XjsFI-0002Qp-V3 for emacs-devel@gnu.org; Thu, 30 Oct 2014 12:06:50 -0400 Original-Received: from mtaout27.012.net.il ([80.179.55.183]:37370) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XjsFI-0002Qh-IP; Thu, 30 Oct 2014 12:06:44 -0400 Original-Received: from conversion-daemon.mtaout27.012.net.il by mtaout27.012.net.il (HyperSendmail v2007.08) id <0NE900N00LGCIE00@mtaout27.012.net.il>; Thu, 30 Oct 2014 18:01:46 +0200 (IST) Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by mtaout27.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0NE900ML1LUXON20@mtaout27.012.net.il>; Thu, 30 Oct 2014 18:01:46 +0200 (IST) In-reply-to: <87bnotwsqn.fsf@maguirefamily.org> X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 80.179.55.183 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:176133 gmane.lisp.gcl.devel:8799 Archived-At: > From: Camm Maguire > Cc: emacs-devel@gnu.org, gcl-devel@gnu.org > Date: Thu, 30 Oct 2014 10:13:20 -0400 > > >> Does every string access in emacs proceed through the utf8 decoder? > > > > If you need to look at the character, yes. E.g., if you need some > > property of the character, you need to index the appropriate table by > > that character's codepoint. But in most operations that is not > > needed. You just need to recognize several specific characters, like > > the null character, the slash, etc., most of which are ASCII. > > > > Do you allocate a fresh boxed character on each aref, or output an > integer referring to a fixed ~2^22 sized table? I'm not sure what you mean by a "boxed character". A character in Emacs is just an int. > Do you maintain such a table in core? We have a lot of tables indexed by characters. Their implementation is memory efficient: it can store identical values for a range of characters, and also store the default value with minimal overhead. > >> > We indeed maintain a cache for byte-to-character and character-to-byte > >> > conversions. > >> > >> How big is this cache? > > > > Its size is dynamic, and depends on how frequently the conversion is > > needed in places that are far away. The cache stores byte-to-char > > correspondence in places that are far away, and Emacs uses binary > > search in between them. > > > > How far is 'far away'? The current heuristic value is 5000 characters. > If you had this to do all over again, would you still opt for the > multibyte? Yes, I think so. I know nobody ever suggested to switch. > While you have buffers to consider too, which probably relate to > strings, it seems to me that the dominant costs are always memory > allocation/gc related, making the memory footprint important but not at > the expense of allocating characters, and that the most frequent > operations are removals/pattern substitutions, which can proceed > bytewise with the same gc overhead. We don't allocate characters, they are just integers. As for strings, Emacs allocates small strings specially, to minimize overhead. And of course, there's GC that takes care of freeing memory. > GCL also supports regular expressions -- how is this modified for utf-8? We use GNU regexp, slightly modified for Emacs. I suggest to take a look at the source.