unofficial mirror of guile-user@gnu.org 
 help / color / mirror / Atom feed
From: David Kastrup <dak@gnu.org>
To: Andy Wingo <wingo@pobox.com>
Cc: guile-user@gnu.org
Subject: Re: String internals sketch
Date: Fri, 10 Mar 2017 17:51:43 +0100	[thread overview]
Message-ID: <87a88t2hy8.fsf@fencepost.gnu.org> (raw)
In-Reply-To: <87wpbxru68.fsf@pobox.com> (Andy Wingo's message of "Fri, 10 Mar 2017 17:08:31 +0100")

Andy Wingo <wingo@pobox.com> writes:

> Hi :)
>
> On Fri 10 Mar 2017 16:31, David Kastrup <dak@gnu.org> writes:
>
>> a) Guile already uses two different internal representations: basically
>> UCS-8 and UCS-32.  Adding more internal representations could be done
>> using a number of tables indexed by the internal representation type,
>> making string representations sort of a "plugin".
>
> I think we probably want to avoid this if we can.  We gain a number of
> efficiencies if we can be concrete.

Well, the crucial question is where to hook the specialization for the
best compromise between performance, versatilty and effort.

> Of course there are counterexamples in which specialization can help,
> like the 20-some string kinds in V8, for example: cons strings, latin1
> strings, utf-16 strings, external strings, slices, and the product of
> all of those; but I am hesitant to take on this cost.  If we switched
> to UTF-8 strings, I would like to use it as our only string
> representation.
>
> Sure would be nice to have cons strings though!  (That would give O(1)
> string-append.)

String buffers should have O(1) string-append.

>> b) Scheme, at least older dialects, have several O(1) guarantees.
>
> R7RS seems to have relaxed this FWIW.  O(1) is great of course but there
> are reasonable cases to be made for O(log N) being a good tradeoff if
> you get other benefits.

A guaranteed O(log N) is not exactly trivial to come by for utf-8
encoded strings either.

>> c) Indexing is the most important thing one wants to be fast.  For an
>> utf-8 internal representation, a lot is achieved if one caches both
>> last index and last byte offset, preferably also string length as
>> index and byte length.
>
> Consider threads though :/ Caches get a bit complicated there.

If there is actual access of several threads on the same string, a
single-element cache is going to be mostly useless.  Don't know about
typical implementation costs and implementations of per-thread
variables, so I can't make a useful proposal here.

>> d) a bad complication is write access to strings, for example with
>>
>>  -- Scheme Procedure: string-map! proc s [start [end]]
>>  -- C Function: scm_string_map_x (proc, s, start, end)
>
> TBH I wouldn't worry too much about this function in particular;

It's certainly not an important function regarding its use frequency,
but it's a good worst case study and is nicely covered by the
gapped-string model.

> That said, this point:
>
>> The current string character can gain a longer or a shorter byte
>> length in this process.
>
> Is especially gnarly in the threaded case; string updates are no longer
> atomic.  One thread mutating a string might actually corrupt another
> thread.  Right now on x86, updates are entirely atomic; on other
> processors that need barriers, the worst that could happen is that
> another thread could fail to read an update.  We'd have to re-add the
> string write mutex, which would be a bit sad :)

A gapped string does not have a lot of contention, but of course if two
threads wanted to write in different places, there would need to be a
way to either lock the gap until the end of the current operation or to
support multiple gaps in progress.

>> So it should provide a _large_ opportunity for the sakes of
>> applications with profiles akin to Emacs or LilyPond.
>
> I'm sympathetic :) Lots of details to get right (or wrong!) though.

Sure.  Not all implementations need to get along without locks, by the
way.  I am not a fan of providing choice between several bad options
instead of getting one version right, but "right" is so large a field
here, particularly when considering embedding Guile into other systems,
that it might be a worthwhile goal to at least have the "right for any
purpose" goal met at the framework layer, and then try improving the
implementations from there.

> WDYT about the threading aspects?

Good for headaches.  But it might make sense diversifying the headaches
into the several implementations.

-- 
David Kastrup



  reply	other threads:[~2017-03-10 16:51 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-10 15:31 String internals sketch David Kastrup
2017-03-10 16:08 ` Andy Wingo
2017-03-10 16:51   ` David Kastrup [this message]
2017-03-15 15:46   ` David Kastrup
2017-03-11  0:30 ` Matt Wette
2017-03-11  6:44   ` David Kastrup

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87a88t2hy8.fsf@fencepost.gnu.org \
    --to=dak@gnu.org \
    --cc=guile-user@gnu.org \
    --cc=wingo@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).