unofficial mirror of guile-user@gnu.org 
 help / color / mirror / Atom feed
From: Andy Wingo <wingo@pobox.com>
To: David Kastrup <dak@gnu.org>
Cc: guile-user@gnu.org
Subject: Re: String internals sketch
Date: Fri, 10 Mar 2017 17:08:31 +0100	[thread overview]
Message-ID: <87wpbxru68.fsf@pobox.com> (raw)
In-Reply-To: <87efy52lnp.fsf@fencepost.gnu.org> (David Kastrup's message of "Fri, 10 Mar 2017 16:31:38 +0100")

Hi :)

On Fri 10 Mar 2017 16:31, David Kastrup <dak@gnu.org> writes:

> a) Guile already uses two different internal representations: basically
> UCS-8 and UCS-32.  Adding more internal representations could be done
> using a number of tables indexed by the internal representation type,
> making string representations sort of a "plugin".

I think we probably want to avoid this if we can.  We gain a number of
efficiencies if we can be concrete.

Of course there are counterexamples in which specialization can help,
like the 20-some string kinds in V8, for example: cons strings, latin1
strings, utf-16 strings, external strings, slices, and the product of
all of those; but I am hesitant to take on this cost.  If we switched to
UTF-8 strings, I would like to use it as our only string representation.

Sure would be nice to have cons strings though!  (That would give O(1)
string-append.)

> b) Scheme, at least older dialects, have several O(1) guarantees.

R7RS seems to have relaxed this FWIW.  O(1) is great of course but there
are reasonable cases to be made for O(log N) being a good tradeoff if
you get other benefits.

> c) Indexing is the most important thing one wants to be fast.  For an
> utf-8 internal representation, a lot is achieved if one caches both last
> index and last byte offset, preferably also string length as index and
> byte length.

Consider threads though :/ Caches get a bit complicated there.

> d) a bad complication is write access to strings, for example with
>
>  -- Scheme Procedure: string-map! proc s [start [end]]
>  -- C Function: scm_string_map_x (proc, s, start, end)

TBH I wouldn't worry too much about this function in particular; you
could map characters into to a vector and then write those characters
back to the string.  Most modern languages of course have read-only
strings, and destructive operations on strings are mostly used when
filling buffers.

That said, this point:

> The current string character can gain a longer or a shorter byte length
> in this process.

Is especially gnarly in the threaded case; string updates are no longer
atomic.  One thread mutating a string might actually corrupt another
thread.  Right now on x86, updates are entirely atomic; on other
processors that need barriers, the worst that could happen is that
another thread could fail to read an update.  We'd have to re-add the
string write mutex, which would be a bit sad :)

> So it should provide a _large_ opportunity for the sakes of applications
> with profiles akin to Emacs or LilyPond.

I'm sympathetic :) Lots of details to get right (or wrong!) though.
WDYT about the threading aspects?

Andy



  reply	other threads:[~2017-03-10 16:08 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-10 15:31 String internals sketch David Kastrup
2017-03-10 16:08 ` Andy Wingo [this message]
2017-03-10 16:51   ` David Kastrup
2017-03-15 15:46   ` David Kastrup
2017-03-11  0:30 ` Matt Wette
2017-03-11  6:44   ` David Kastrup

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87wpbxru68.fsf@pobox.com \
    --to=wingo@pobox.com \
    --cc=dak@gnu.org \
    --cc=guile-user@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).