From: David Kastrup <dak@gnu.org>
To: Andy Wingo <wingo@pobox.com>
Cc: guile-user@gnu.org
Subject: Re: String internals sketch
Date: Fri, 10 Mar 2017 17:51:43 +0100 [thread overview]
Message-ID: <87a88t2hy8.fsf@fencepost.gnu.org> (raw)
In-Reply-To: <87wpbxru68.fsf@pobox.com> (Andy Wingo's message of "Fri, 10 Mar 2017 17:08:31 +0100")
Andy Wingo <wingo@pobox.com> writes:
> Hi :)
>
> On Fri 10 Mar 2017 16:31, David Kastrup <dak@gnu.org> writes:
>
>> a) Guile already uses two different internal representations: basically
>> UCS-8 and UCS-32. Adding more internal representations could be done
>> using a number of tables indexed by the internal representation type,
>> making string representations sort of a "plugin".
>
> I think we probably want to avoid this if we can. We gain a number of
> efficiencies if we can be concrete.
Well, the crucial question is where to hook the specialization for the
best compromise between performance, versatilty and effort.
> Of course there are counterexamples in which specialization can help,
> like the 20-some string kinds in V8, for example: cons strings, latin1
> strings, utf-16 strings, external strings, slices, and the product of
> all of those; but I am hesitant to take on this cost. If we switched
> to UTF-8 strings, I would like to use it as our only string
> representation.
>
> Sure would be nice to have cons strings though! (That would give O(1)
> string-append.)
String buffers should have O(1) string-append.
>> b) Scheme, at least older dialects, have several O(1) guarantees.
>
> R7RS seems to have relaxed this FWIW. O(1) is great of course but there
> are reasonable cases to be made for O(log N) being a good tradeoff if
> you get other benefits.
A guaranteed O(log N) is not exactly trivial to come by for utf-8
encoded strings either.
>> c) Indexing is the most important thing one wants to be fast. For an
>> utf-8 internal representation, a lot is achieved if one caches both
>> last index and last byte offset, preferably also string length as
>> index and byte length.
>
> Consider threads though :/ Caches get a bit complicated there.
If there is actual access of several threads on the same string, a
single-element cache is going to be mostly useless. Don't know about
typical implementation costs and implementations of per-thread
variables, so I can't make a useful proposal here.
>> d) a bad complication is write access to strings, for example with
>>
>> -- Scheme Procedure: string-map! proc s [start [end]]
>> -- C Function: scm_string_map_x (proc, s, start, end)
>
> TBH I wouldn't worry too much about this function in particular;
It's certainly not an important function regarding its use frequency,
but it's a good worst case study and is nicely covered by the
gapped-string model.
> That said, this point:
>
>> The current string character can gain a longer or a shorter byte
>> length in this process.
>
> Is especially gnarly in the threaded case; string updates are no longer
> atomic. One thread mutating a string might actually corrupt another
> thread. Right now on x86, updates are entirely atomic; on other
> processors that need barriers, the worst that could happen is that
> another thread could fail to read an update. We'd have to re-add the
> string write mutex, which would be a bit sad :)
A gapped string does not have a lot of contention, but of course if two
threads wanted to write in different places, there would need to be a
way to either lock the gap until the end of the current operation or to
support multiple gaps in progress.
>> So it should provide a _large_ opportunity for the sakes of
>> applications with profiles akin to Emacs or LilyPond.
>
> I'm sympathetic :) Lots of details to get right (or wrong!) though.
Sure. Not all implementations need to get along without locks, by the
way. I am not a fan of providing choice between several bad options
instead of getting one version right, but "right" is so large a field
here, particularly when considering embedding Guile into other systems,
that it might be a worthwhile goal to at least have the "right for any
purpose" goal met at the framework layer, and then try improving the
implementations from there.
> WDYT about the threading aspects?
Good for headaches. But it might make sense diversifying the headaches
into the several implementations.
--
David Kastrup
next prev parent reply other threads:[~2017-03-10 16:51 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-03-10 15:31 String internals sketch David Kastrup
2017-03-10 16:08 ` Andy Wingo
2017-03-10 16:51 ` David Kastrup [this message]
2017-03-15 15:46 ` David Kastrup
2017-03-11 0:30 ` Matt Wette
2017-03-11 6:44 ` David Kastrup
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87a88t2hy8.fsf@fencepost.gnu.org \
--to=dak@gnu.org \
--cc=guile-user@gnu.org \
--cc=wingo@pobox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).