From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: David Kastrup Newsgroups: gmane.lisp.guile.user Subject: String internals sketch Date: Fri, 10 Mar 2017 16:31:38 +0100 Organization: Organization?!? Message-ID: <87efy52lnp.fsf@fencepost.gnu.org> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain X-Trace: blaine.gmane.org 1489159953 22461 195.159.176.226 (10 Mar 2017 15:32:33 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Fri, 10 Mar 2017 15:32:33 +0000 (UTC) User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.0.50 (gnu/linux) To: guile-user@gnu.org Original-X-From: guile-user-bounces+guile-user=m.gmane.org@gnu.org Fri Mar 10 16:32:29 2017 Return-path: Envelope-to: guile-user@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cmMWo-00057q-Id for guile-user@m.gmane.org; Fri, 10 Mar 2017 16:32:26 +0100 Original-Received: from localhost ([::1]:39578 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cmMWu-0006rM-ID for guile-user@m.gmane.org; Fri, 10 Mar 2017 10:32:32 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:59238) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cmMWX-0006r5-IZ for guile-user@gnu.org; Fri, 10 Mar 2017 10:32:10 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cmMWU-00046B-FE for guile-user@gnu.org; Fri, 10 Mar 2017 10:32:09 -0500 Original-Received: from [195.159.176.226] (port=59037 helo=blaine.gmane.org) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1cmMWU-00045X-8K for guile-user@gnu.org; Fri, 10 Mar 2017 10:32:06 -0500 Original-Received: from list by blaine.gmane.org with local (Exim 4.84_2) (envelope-from ) id 1cmMW9-0001Mb-5j for guile-user@gnu.org; Fri, 10 Mar 2017 16:31:45 +0100 X-Injected-Via-Gmane: http://gmane.org/ Original-Lines: 65 Original-X-Complaints-To: usenet@blaine.gmane.org X-Face: 2FEFf>]>q>2iw=B6, xrUubRI>pR&Ml9=ao@P@i)L:\urd*t9M~y1^:+Y]'C0~{mAl`oQuAl \!3KEIp?*w`|bL5qr,H)LFO6Q=qx~iH4DN; i"; /yuIsqbLLCh/!U#X[S~(5eZ41to5f%E@'ELIi$t^ Vc\LWP@J5p^rst0+('>Er0=^1{]M9!p?&:\z]|;&=NP3AhB!B_bi^]Pfkw Cancel-Lock: sha1:YMpfA/u1B4TtYyphwkWb/c4USXk= X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 195.159.176.226 X-BeenThere: guile-user@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: General Guile related discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org Original-Sender: "guile-user" Xref: news.gmane.org gmane.lisp.guile.user:13479 Archived-At: Hi, I've been mulling a bit over enabling UTF-8 as an internal string representation. There are several interesting points to consider: a) Guile already uses two different internal representations: basically UCS-8 and UCS-32. Adding more internal representations could be done using a number of tables indexed by the internal representation type, making string representations sort of a "plugin". b) Scheme, at least older dialects, have several O(1) guarantees. If the chosen internal representations can be configured with some fluid (like it is done with port encodings), possibly as a list in priority order, one can retain an O(1) guarantee as the default behavior while allowing different mechanisms for different requirements. c) Indexing is the most important thing one wants to be fast. For an utf-8 internal representation, a lot is achieved if one caches both last index and last byte offset, preferably also string length as index and byte length. For each indexing operation, one can then first check the last string-middle cache, and then scan forwards or backwards to do the indexing (if string start or string end are closer, one can scan from there), assuming reasonably sanitized behavior. That will make sequential processing close to O(1) behavior. d) a bad complication is write access to strings, for example with -- Scheme Procedure: string-map! proc s [start [end]] -- C Function: scm_string_map_x (proc, s, start, end) PROC is a char->char procedure, it is mapped over S. The order in which the procedure is applied to the string elements is not specified. The string S is modified in-place, the return value is not specified. The current string character can gain a longer or a shorter byte length in this process. This requires being able to deal with insertion/deletion of bytes at the current position. One way to do that is to have a _gap_ for the purpose of string-map! and similar functions and copy material from the end of the gap to its start while iterating. Now this starts looking so much like the memory management Emacs buffers that it isn't funny. It also is _the_ natural string representation to use for things like with-output-to-string-buffer and with-input-to-string-buffer. Basically, one valid string representation would coincide with an utf-8 based string buffer. This would seem like it would also give a boost to some Guile operations and would offer possibilities for making "string representation plugins" for special purposes. This should offer a lot of leeway to gradually suck up the best parts of things like Emacs' string representation and code conversion facilities into a Guile kernel without being tied down by the current Guile facilities and without having stuff like Emacs strings black boxes mostly inaccessible to Guile programming. It should also allow to bounce stuff like sanitized utf-8 through Guile without incurring constant conversion costs. So it should provide a _large_ opportunity for the sakes of applications with profiles akin to Emacs or LilyPond. -- David Kastrup