From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel,gmane.lisp.gcl.devel Subject: Re: utf8 and emacs text/string multibyte representation Date: Wed, 29 Oct 2014 18:19:07 +0200 Message-ID: <83bnou26is.fsf@gnu.org> References: <87wq7jxc7d.fsf@gnu.org> <87zjcfx985.fsf_-_@maguirefamily.org> <83mw8f0w08.fsf@gnu.org> <87oasu3m72.fsf@maguirefamily.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org X-Trace: ger.gmane.org 1414599578 22979 80.91.229.3 (29 Oct 2014 16:19:38 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 29 Oct 2014 16:19:38 +0000 (UTC) Cc: gcl-devel@gnu.org, emacs-devel@gnu.org To: Camm Maguire Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Oct 29 17:19:31 2014 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1XjVy4-0003EJ-Ah for ged-emacs-devel@m.gmane.org; Wed, 29 Oct 2014 17:19:28 +0100 Original-Received: from localhost ([::1]:47231 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XjVy3-0001PR-SS for ged-emacs-devel@m.gmane.org; Wed, 29 Oct 2014 12:19:27 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:47370) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XjVxv-0001IK-Cb for emacs-devel@gnu.org; Wed, 29 Oct 2014 12:19:24 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XjVxq-0001MS-4Y for emacs-devel@gnu.org; Wed, 29 Oct 2014 12:19:19 -0400 Original-Received: from mtaout24.012.net.il ([80.179.55.180]:48467) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XjVxp-0001MM-Sn; Wed, 29 Oct 2014 12:19:14 -0400 Original-Received: from conversion-daemon.mtaout24.012.net.il by mtaout24.012.net.il (HyperSendmail v2007.08) id <0NE700100RF59100@mtaout24.012.net.il>; Wed, 29 Oct 2014 18:12:15 +0200 (IST) Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by mtaout24.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0NE700I2WROF8B90@mtaout24.012.net.il>; Wed, 29 Oct 2014 18:12:15 +0200 (IST) In-reply-to: <87oasu3m72.fsf@maguirefamily.org> X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 80.179.55.180 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:176016 gmane.lisp.gcl.devel:8790 Archived-At: > From: Camm Maguire > Cc: emacs-devel@gnu.org, gcl-devel@gnu.org > Date: Wed, 29 Oct 2014 11:55:13 -0400 > > I thought there would be a little more on the upside, say some benefit > from having the internal representation be the same as that used in many > external representations, at least on linux Yes, that too. Emacs originally used a very different internal encoding (ISO-2022 based), and the switch to UTF-8 based was due to the above. In general, having a Unicode basis works better when you want to support various Unicode defined features, like the UCA etc. > and perhaps some algorithm coalescing with straightforward byte-wise > operations. Not sure what you mean here, please elaborate. In general, many operations with UTF-8 strings can use the usual string library functions, as you probably know very well. > Does every string access in emacs proceed through the utf8 decoder? If you need to look at the character, yes. E.g., if you need some property of the character, you need to index the appropriate table by that character's codepoint. But in most operations that is not needed. You just need to recognize several specific characters, like the null character, the slash, etc., most of which are ASCII. > >> A cached internal pointer storing the last referenced codepoint > >> offset makes access essentially O(1). > > > > We indeed maintain a cache for byte-to-character and character-to-byte > > conversions. > > How big is this cache? Its size is dynamic, and depends on how frequently the conversion is needed in places that are far away. The cache stores byte-to-char correspondence in places that are far away, and Emacs uses binary search in between them. > >> Yet setting string elements can trigger reallocations/memmove > >> operations. > > > > Emacs, as every editor, needs to handle this efficiently anyway, > > because editing operations rarely leave the buffer size unchanged. So > > Emacs uses a gap to minimize reallocations. > > > > But no gap in strings, right (i.e. just buffers)? Right.