From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: David Kastrup Newsgroups: gmane.lisp.guile.user Subject: Re: String internals sketch Date: Fri, 10 Mar 2017 17:51:43 +0100 Message-ID: <87a88t2hy8.fsf@fencepost.gnu.org> References: <87efy52lnp.fsf@fencepost.gnu.org> <87wpbxru68.fsf@pobox.com> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain X-Trace: blaine.gmane.org 1489164771 1116 195.159.176.226 (10 Mar 2017 16:52:51 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Fri, 10 Mar 2017 16:52:51 +0000 (UTC) User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.0.50 (gnu/linux) Cc: guile-user@gnu.org To: Andy Wingo Original-X-From: guile-user-bounces+guile-user=m.gmane.org@gnu.org Fri Mar 10 17:52:46 2017 Return-path: Envelope-to: guile-user@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cmNmU-0007w4-N2 for guile-user@m.gmane.org; Fri, 10 Mar 2017 17:52:42 +0100 Original-Received: from localhost ([::1]:39885 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cmNma-0006d6-Q9 for guile-user@m.gmane.org; Fri, 10 Mar 2017 11:52:48 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:52769) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cmNm5-0006bt-Ap for guile-user@gnu.org; Fri, 10 Mar 2017 11:52:18 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cmNm1-000852-4A for guile-user@gnu.org; Fri, 10 Mar 2017 11:52:17 -0500 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:51373) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cmNm1-00084g-0Y; Fri, 10 Mar 2017 11:52:13 -0500 Original-Received: from ip-109-47-195-199.web.vodafone.de ([109.47.195.199]:53341 helo=lola) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1cmNlw-0006KA-Er; Fri, 10 Mar 2017 11:52:12 -0500 In-Reply-To: <87wpbxru68.fsf@pobox.com> (Andy Wingo's message of "Fri, 10 Mar 2017 17:08:31 +0100") X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-BeenThere: guile-user@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: General Guile related discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org Original-Sender: "guile-user" Xref: news.gmane.org gmane.lisp.guile.user:13483 Archived-At: Andy Wingo writes: > Hi :) > > On Fri 10 Mar 2017 16:31, David Kastrup writes: > >> a) Guile already uses two different internal representations: basically >> UCS-8 and UCS-32. Adding more internal representations could be done >> using a number of tables indexed by the internal representation type, >> making string representations sort of a "plugin". > > I think we probably want to avoid this if we can. We gain a number of > efficiencies if we can be concrete. Well, the crucial question is where to hook the specialization for the best compromise between performance, versatilty and effort. > Of course there are counterexamples in which specialization can help, > like the 20-some string kinds in V8, for example: cons strings, latin1 > strings, utf-16 strings, external strings, slices, and the product of > all of those; but I am hesitant to take on this cost. If we switched > to UTF-8 strings, I would like to use it as our only string > representation. > > Sure would be nice to have cons strings though! (That would give O(1) > string-append.) String buffers should have O(1) string-append. >> b) Scheme, at least older dialects, have several O(1) guarantees. > > R7RS seems to have relaxed this FWIW. O(1) is great of course but there > are reasonable cases to be made for O(log N) being a good tradeoff if > you get other benefits. A guaranteed O(log N) is not exactly trivial to come by for utf-8 encoded strings either. >> c) Indexing is the most important thing one wants to be fast. For an >> utf-8 internal representation, a lot is achieved if one caches both >> last index and last byte offset, preferably also string length as >> index and byte length. > > Consider threads though :/ Caches get a bit complicated there. If there is actual access of several threads on the same string, a single-element cache is going to be mostly useless. Don't know about typical implementation costs and implementations of per-thread variables, so I can't make a useful proposal here. >> d) a bad complication is write access to strings, for example with >> >> -- Scheme Procedure: string-map! proc s [start [end]] >> -- C Function: scm_string_map_x (proc, s, start, end) > > TBH I wouldn't worry too much about this function in particular; It's certainly not an important function regarding its use frequency, but it's a good worst case study and is nicely covered by the gapped-string model. > That said, this point: > >> The current string character can gain a longer or a shorter byte >> length in this process. > > Is especially gnarly in the threaded case; string updates are no longer > atomic. One thread mutating a string might actually corrupt another > thread. Right now on x86, updates are entirely atomic; on other > processors that need barriers, the worst that could happen is that > another thread could fail to read an update. We'd have to re-add the > string write mutex, which would be a bit sad :) A gapped string does not have a lot of contention, but of course if two threads wanted to write in different places, there would need to be a way to either lock the gap until the end of the current operation or to support multiple gaps in progress. >> So it should provide a _large_ opportunity for the sakes of >> applications with profiles akin to Emacs or LilyPond. > > I'm sympathetic :) Lots of details to get right (or wrong!) though. Sure. Not all implementations need to get along without locks, by the way. I am not a fan of providing choice between several bad options instead of getting one version right, but "right" is so large a field here, particularly when considering embedding Guile into other systems, that it might be a worthwhile goal to at least have the "right for any purpose" goal met at the framework layer, and then try improving the implementations from there. > WDYT about the threading aspects? Good for headaches. But it might make sense diversifying the headaches into the several implementations. -- David Kastrup