From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Andy Wingo Newsgroups: gmane.lisp.guile.user Subject: Re: String internals sketch Date: Fri, 10 Mar 2017 17:08:31 +0100 Message-ID: <87wpbxru68.fsf@pobox.com> References: <87efy52lnp.fsf@fencepost.gnu.org> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain X-Trace: blaine.gmane.org 1489162157 3413 195.159.176.226 (10 Mar 2017 16:09:17 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Fri, 10 Mar 2017 16:09:17 +0000 (UTC) User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux) Cc: guile-user@gnu.org To: David Kastrup Original-X-From: guile-user-bounces+guile-user=m.gmane.org@gnu.org Fri Mar 10 17:09:12 2017 Return-path: Envelope-to: guile-user@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cmN6K-0008Cj-10 for guile-user@m.gmane.org; Fri, 10 Mar 2017 17:09:08 +0100 Original-Received: from localhost ([::1]:39712 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cmN6Q-0006zS-0I for guile-user@m.gmane.org; Fri, 10 Mar 2017 11:09:14 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:41174) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cmN5z-0006z2-JS for guile-user@gnu.org; Fri, 10 Mar 2017 11:08:48 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cmN5y-0003QW-F2 for guile-user@gnu.org; Fri, 10 Mar 2017 11:08:47 -0500 Original-Received: from pb-sasl1.pobox.com ([64.147.108.66]:59126 helo=sasl.smtp.pobox.com) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1cmN5t-0003O2-LX; Fri, 10 Mar 2017 11:08:41 -0500 Original-Received: from sasl.smtp.pobox.com (unknown [127.0.0.1]) by pb-sasl1.pobox.com (Postfix) with ESMTP id 4BFF17FA2B; Fri, 10 Mar 2017 11:08:39 -0500 (EST) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=pobox.com; h=from:to:cc :subject:references:date:in-reply-to:message-id:mime-version :content-type; s=sasl; bh=ZJzRZ4j+urVexjX7p/ul7fl0ftA=; b=YD9Ief hn6tNxuE+W/IKAOzCgbgKQkqdrc+7NFyeTJs2dVpMoOrE73IiMUbxXulcNKG67Eo 5XjNKBX4ZYU4wZTJ7vho5xqNfNcwdo+voOxFfhLXWZfdCa5eeB1Ac328nYIKErBs FJYQ82H3chHCc86ad9YUdtP9waR9KdKCD6wXw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=pobox.com; h=from:to:cc :subject:references:date:in-reply-to:message-id:mime-version :content-type; q=dns; s=sasl; b=RPHTOQ8mViI0NURtLZm2XUneez94lZnE b24Q3M0YIW1qjqGW1J9SNnmMFQxv2Yh8CTXFqHfyYDJGnfP2Ez46jetw7mycbHF/ Frcw2vrcNAfZdVfHfNvy2h+LbSXqiD5KV9lCOMSzT3HPZtPA8ewfEv8zlmTvUEeS 9S/Zjz7yYLA= Original-Received: from pb-sasl1.nyi.icgroup.com (unknown [127.0.0.1]) by pb-sasl1.pobox.com (Postfix) with ESMTP id 456677FA2A; Fri, 10 Mar 2017 11:08:39 -0500 (EST) Original-Received: from clucks (unknown [88.160.190.192]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by pb-sasl1.pobox.com (Postfix) with ESMTPSA id 354A07FA29; Fri, 10 Mar 2017 11:08:38 -0500 (EST) In-Reply-To: <87efy52lnp.fsf@fencepost.gnu.org> (David Kastrup's message of "Fri, 10 Mar 2017 16:31:38 +0100") X-Pobox-Relay-ID: D25DB1A2-05AB-11E7-B1D0-B667064AB293-02397024!pb-sasl1.pobox.com X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 64.147.108.66 X-BeenThere: guile-user@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: General Guile related discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org Original-Sender: "guile-user" Xref: news.gmane.org gmane.lisp.guile.user:13480 Archived-At: Hi :) On Fri 10 Mar 2017 16:31, David Kastrup writes: > a) Guile already uses two different internal representations: basically > UCS-8 and UCS-32. Adding more internal representations could be done > using a number of tables indexed by the internal representation type, > making string representations sort of a "plugin". I think we probably want to avoid this if we can. We gain a number of efficiencies if we can be concrete. Of course there are counterexamples in which specialization can help, like the 20-some string kinds in V8, for example: cons strings, latin1 strings, utf-16 strings, external strings, slices, and the product of all of those; but I am hesitant to take on this cost. If we switched to UTF-8 strings, I would like to use it as our only string representation. Sure would be nice to have cons strings though! (That would give O(1) string-append.) > b) Scheme, at least older dialects, have several O(1) guarantees. R7RS seems to have relaxed this FWIW. O(1) is great of course but there are reasonable cases to be made for O(log N) being a good tradeoff if you get other benefits. > c) Indexing is the most important thing one wants to be fast. For an > utf-8 internal representation, a lot is achieved if one caches both last > index and last byte offset, preferably also string length as index and > byte length. Consider threads though :/ Caches get a bit complicated there. > d) a bad complication is write access to strings, for example with > > -- Scheme Procedure: string-map! proc s [start [end]] > -- C Function: scm_string_map_x (proc, s, start, end) TBH I wouldn't worry too much about this function in particular; you could map characters into to a vector and then write those characters back to the string. Most modern languages of course have read-only strings, and destructive operations on strings are mostly used when filling buffers. That said, this point: > The current string character can gain a longer or a shorter byte length > in this process. Is especially gnarly in the threaded case; string updates are no longer atomic. One thread mutating a string might actually corrupt another thread. Right now on x86, updates are entirely atomic; on other processors that need barriers, the worst that could happen is that another thread could fail to read an update. We'd have to re-add the string write mutex, which would be a bit sad :) > So it should provide a _large_ opportunity for the sakes of applications > with profiles akin to Emacs or LilyPond. I'm sympathetic :) Lots of details to get right (or wrong!) though. WDYT about the threading aspects? Andy