String internals sketch

unofficial mirror of guile-user@gnu.org 
 help / color / mirror / Atom feed

* String internals sketch
@ 2017-03-10 15:31 David Kastrup
  2017-03-10 16:08 ` Andy Wingo
  2017-03-11  0:30 ` Matt Wette
  0 siblings, 2 replies; 6+ messages in thread
From: David Kastrup @ 2017-03-10 15:31 UTC (permalink / raw)
  To: guile-user

Hi,

I've been mulling a bit over enabling UTF-8 as an internal string
representation.  There are several interesting points to consider:

a) Guile already uses two different internal representations: basically
UCS-8 and UCS-32.  Adding more internal representations could be done
using a number of tables indexed by the internal representation type,
making string representations sort of a "plugin".

b) Scheme, at least older dialects, have several O(1) guarantees.  If
the chosen internal representations can be configured with some fluid
(like it is done with port encodings), possibly as a list in priority
order, one can retain an O(1) guarantee as the default behavior while
allowing different mechanisms for different requirements.

c) Indexing is the most important thing one wants to be fast.  For an
utf-8 internal representation, a lot is achieved if one caches both last
index and last byte offset, preferably also string length as index and
byte length.  For each indexing operation, one can then first check the
last string-middle cache, and then scan forwards or backwards to do the
indexing (if string start or string end are closer, one can scan from
there), assuming reasonably sanitized behavior.  That will make
sequential processing close to O(1) behavior.

d) a bad complication is write access to strings, for example with

 -- Scheme Procedure: string-map! proc s [start [end]]
 -- C Function: scm_string_map_x (proc, s, start, end)
     PROC is a char->char procedure, it is mapped over S.  The order in
     which the procedure is applied to the string elements is not
     specified.  The string S is modified in-place, the return value is
     not specified.

The current string character can gain a longer or a shorter byte length
in this process.  This requires being able to deal with
insertion/deletion of bytes at the current position.  One way to do that
is to have a _gap_ for the purpose of string-map! and similar functions
and copy material from the end of the gap to its start while iterating.

Now this starts looking so much like the memory management Emacs buffers
that it isn't funny.  It also is _the_ natural string representation to
use for things like with-output-to-string-buffer and
with-input-to-string-buffer.

Basically, one valid string representation would coincide with an utf-8
based string buffer.

This would seem like it would also give a boost to some Guile operations
and would offer possibilities for making "string representation plugins"
for special purposes.  This should offer a lot of leeway to gradually
suck up the best parts of things like Emacs' string representation and
code conversion facilities into a Guile kernel without being tied down
by the current Guile facilities and without having stuff like Emacs
strings black boxes mostly inaccessible to Guile programming.

It should also allow to bounce stuff like sanitized utf-8 through Guile
without incurring constant conversion costs.

So it should provide a _large_ opportunity for the sakes of applications
with profiles akin to Emacs or LilyPond.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: String internals sketch
  2017-03-10 15:31 String internals sketch David Kastrup
@ 2017-03-10 16:08 ` Andy Wingo
  2017-03-10 16:51   ` David Kastrup
  2017-03-15 15:46   ` David Kastrup
  2017-03-11  0:30 ` Matt Wette
  1 sibling, 2 replies; 6+ messages in thread
From: Andy Wingo @ 2017-03-10 16:08 UTC (permalink / raw)
  To: David Kastrup; +Cc: guile-user

Hi :)

On Fri 10 Mar 2017 16:31, David Kastrup <dak@gnu.org> writes:

> a) Guile already uses two different internal representations: basically
> UCS-8 and UCS-32.  Adding more internal representations could be done
> using a number of tables indexed by the internal representation type,
> making string representations sort of a "plugin".

I think we probably want to avoid this if we can.  We gain a number of
efficiencies if we can be concrete.

Of course there are counterexamples in which specialization can help,
like the 20-some string kinds in V8, for example: cons strings, latin1
strings, utf-16 strings, external strings, slices, and the product of
all of those; but I am hesitant to take on this cost.  If we switched to
UTF-8 strings, I would like to use it as our only string representation.

Sure would be nice to have cons strings though!  (That would give O(1)
string-append.)

> b) Scheme, at least older dialects, have several O(1) guarantees.

R7RS seems to have relaxed this FWIW.  O(1) is great of course but there
are reasonable cases to be made for O(log N) being a good tradeoff if
you get other benefits.

> c) Indexing is the most important thing one wants to be fast.  For an
> utf-8 internal representation, a lot is achieved if one caches both last
> index and last byte offset, preferably also string length as index and
> byte length.

Consider threads though :/ Caches get a bit complicated there.

> d) a bad complication is write access to strings, for example with
>
>  -- Scheme Procedure: string-map! proc s [start [end]]
>  -- C Function: scm_string_map_x (proc, s, start, end)

TBH I wouldn't worry too much about this function in particular; you
could map characters into to a vector and then write those characters
back to the string.  Most modern languages of course have read-only
strings, and destructive operations on strings are mostly used when
filling buffers.

That said, this point:

> The current string character can gain a longer or a shorter byte length
> in this process.

Is especially gnarly in the threaded case; string updates are no longer
atomic.  One thread mutating a string might actually corrupt another
thread.  Right now on x86, updates are entirely atomic; on other
processors that need barriers, the worst that could happen is that
another thread could fail to read an update.  We'd have to re-add the
string write mutex, which would be a bit sad :)

> So it should provide a _large_ opportunity for the sakes of applications
> with profiles akin to Emacs or LilyPond.

I'm sympathetic :) Lots of details to get right (or wrong!) though.
WDYT about the threading aspects?

Andy

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: String internals sketch
  2017-03-10 16:08 ` Andy Wingo
@ 2017-03-10 16:51   ` David Kastrup
  2017-03-15 15:46   ` David Kastrup
  1 sibling, 0 replies; 6+ messages in thread
From: David Kastrup @ 2017-03-10 16:51 UTC (permalink / raw)
  To: Andy Wingo; +Cc: guile-user

Andy Wingo <wingo@pobox.com> writes:

> Hi :)
>
> On Fri 10 Mar 2017 16:31, David Kastrup <dak@gnu.org> writes:
>
>> a) Guile already uses two different internal representations: basically
>> UCS-8 and UCS-32.  Adding more internal representations could be done
>> using a number of tables indexed by the internal representation type,
>> making string representations sort of a "plugin".
>
> I think we probably want to avoid this if we can.  We gain a number of
> efficiencies if we can be concrete.

Well, the crucial question is where to hook the specialization for the
best compromise between performance, versatilty and effort.

> Of course there are counterexamples in which specialization can help,
> like the 20-some string kinds in V8, for example: cons strings, latin1
> strings, utf-16 strings, external strings, slices, and the product of
> all of those; but I am hesitant to take on this cost.  If we switched
> to UTF-8 strings, I would like to use it as our only string
> representation.
>
> Sure would be nice to have cons strings though!  (That would give O(1)
> string-append.)

String buffers should have O(1) string-append.

>> b) Scheme, at least older dialects, have several O(1) guarantees.
>
> R7RS seems to have relaxed this FWIW.  O(1) is great of course but there
> are reasonable cases to be made for O(log N) being a good tradeoff if
> you get other benefits.

A guaranteed O(log N) is not exactly trivial to come by for utf-8
encoded strings either.

>> c) Indexing is the most important thing one wants to be fast.  For an
>> utf-8 internal representation, a lot is achieved if one caches both
>> last index and last byte offset, preferably also string length as
>> index and byte length.
>
> Consider threads though :/ Caches get a bit complicated there.

If there is actual access of several threads on the same string, a
single-element cache is going to be mostly useless.  Don't know about
typical implementation costs and implementations of per-thread
variables, so I can't make a useful proposal here.

>> d) a bad complication is write access to strings, for example with
>>
>>  -- Scheme Procedure: string-map! proc s [start [end]]
>>  -- C Function: scm_string_map_x (proc, s, start, end)
>
> TBH I wouldn't worry too much about this function in particular;

It's certainly not an important function regarding its use frequency,
but it's a good worst case study and is nicely covered by the
gapped-string model.

> That said, this point:
>
>> The current string character can gain a longer or a shorter byte
>> length in this process.
>
> Is especially gnarly in the threaded case; string updates are no longer
> atomic.  One thread mutating a string might actually corrupt another
> thread.  Right now on x86, updates are entirely atomic; on other
> processors that need barriers, the worst that could happen is that
> another thread could fail to read an update.  We'd have to re-add the
> string write mutex, which would be a bit sad :)

A gapped string does not have a lot of contention, but of course if two
threads wanted to write in different places, there would need to be a
way to either lock the gap until the end of the current operation or to
support multiple gaps in progress.

>> So it should provide a _large_ opportunity for the sakes of
>> applications with profiles akin to Emacs or LilyPond.
>
> I'm sympathetic :) Lots of details to get right (or wrong!) though.

Sure.  Not all implementations need to get along without locks, by the
way.  I am not a fan of providing choice between several bad options
instead of getting one version right, but "right" is so large a field
here, particularly when considering embedding Guile into other systems,
that it might be a worthwhile goal to at least have the "right for any
purpose" goal met at the framework layer, and then try improving the
implementations from there.

> WDYT about the threading aspects?

Good for headaches.  But it might make sense diversifying the headaches
into the several implementations.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: String internals sketch
  2017-03-10 15:31 String internals sketch David Kastrup
  2017-03-10 16:08 ` Andy Wingo
@ 2017-03-11  0:30 ` Matt Wette
  2017-03-11  6:44   ` David Kastrup
  1 sibling, 1 reply; 6+ messages in thread
From: Matt Wette @ 2017-03-11  0:30 UTC (permalink / raw)
  To: David Kastrup; +Cc: guile-user

> On Mar 10, 2017, at 7:31 AM, David Kastrup <dak@gnu.org> wrote:
> I've been mulling a bit over enabling UTF-8 as an internal string
> representation.  There are several interesting points to consider:

One point to consider on this topic is that guile is targeted (via the compiler tower) to support other languages.   Strings in ecmascript, for example, are 16-bit code points, not 8 or 32.  I personally don’t care that the guile-2.2 hosted ecmascript can’t be a pedantic implementation, but it may generate issues so some down the road.  (BTW, the other area in which guile does not implement ecmascript is in representation of numbers

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: String internals sketch
  2017-03-11  0:30 ` Matt Wette
@ 2017-03-11  6:44   ` David Kastrup
  0 siblings, 0 replies; 6+ messages in thread
From: David Kastrup @ 2017-03-11  6:44 UTC (permalink / raw)
  To: guile-user

Matt Wette <matt.wette@gmail.com> writes:

>> On Mar 10, 2017, at 7:31 AM, David Kastrup <dak@gnu.org> wrote:
>> I've been mulling a bit over enabling UTF-8 as an internal string
>> representation.  There are several interesting points to consider:
>
> One point to consider on this topic is that guile is targeted (via the
> compiler tower) to support other languages.  Strings in ecmascript,
> for example, are 16-bit code points, not 8 or 32.  I personally don’t
> care that the guile-2.2 hosted ecmascript can’t be a pedantic
> implementation, but it may generate issues so some down the road.

Well, that's an "engine" one would need to create complete support for.
Some things would be nicer to specify via C++ templates rather than
function hooks.  C macros can sort-of be used for providing similar
functionality, but they are quite more opaque to compiler, debugger, and
programmer and more ad-hoc.

Not proposing to move to C++: that would seriously impact the ability to
integrate into other systems.  Just a musing.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: String internals sketch
  2017-03-10 16:08 ` Andy Wingo
  2017-03-10 16:51   ` David Kastrup
@ 2017-03-15 15:46   ` David Kastrup
  1 sibling, 0 replies; 6+ messages in thread
From: David Kastrup @ 2017-03-15 15:46 UTC (permalink / raw)
  To: guile-user

Andy Wingo <wingo@pobox.com> writes:

> Hi :)
>
> On Fri 10 Mar 2017 16:31, David Kastrup <dak@gnu.org> writes:
>
>> a) Guile already uses two different internal representations: basically
>> UCS-8 and UCS-32.  Adding more internal representations could be done
>> using a number of tables indexed by the internal representation type,
>> making string representations sort of a "plugin".
>
> I think we probably want to avoid this if we can.  We gain a number of
> efficiencies if we can be concrete.

Maybe use GOOPS on the string ops by diversifying on string subtypes?
One would start by making UCS-32 and UCS-8 string implementations
subtypes of strings, then add UTF-8-internal (which is identical to
UTF-8 for all-valid UTF-8)?  I doubt that this would beat a vector of
dedicated hooks but it might look more transparent to implementors with
special needs?  Maybe not: they might not approach this from a Guile
angle of thinking.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-03-15 15:46 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-10 15:31 String internals sketch David Kastrup
2017-03-10 16:08 ` Andy Wingo
2017-03-10 16:51   ` David Kastrup
2017-03-15 15:46   ` David Kastrup
2017-03-11  0:30 ` Matt Wette
2017-03-11  6:44   ` David Kastrup

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).