From: David Kastrup <dak@gnu.org>
To: guile-user@gnu.org
Subject: String internals sketch
Date: Fri, 10 Mar 2017 16:31:38 +0100 [thread overview]
Message-ID: <87efy52lnp.fsf@fencepost.gnu.org> (raw)
Hi,
I've been mulling a bit over enabling UTF-8 as an internal string
representation. There are several interesting points to consider:
a) Guile already uses two different internal representations: basically
UCS-8 and UCS-32. Adding more internal representations could be done
using a number of tables indexed by the internal representation type,
making string representations sort of a "plugin".
b) Scheme, at least older dialects, have several O(1) guarantees. If
the chosen internal representations can be configured with some fluid
(like it is done with port encodings), possibly as a list in priority
order, one can retain an O(1) guarantee as the default behavior while
allowing different mechanisms for different requirements.
c) Indexing is the most important thing one wants to be fast. For an
utf-8 internal representation, a lot is achieved if one caches both last
index and last byte offset, preferably also string length as index and
byte length. For each indexing operation, one can then first check the
last string-middle cache, and then scan forwards or backwards to do the
indexing (if string start or string end are closer, one can scan from
there), assuming reasonably sanitized behavior. That will make
sequential processing close to O(1) behavior.
d) a bad complication is write access to strings, for example with
-- Scheme Procedure: string-map! proc s [start [end]]
-- C Function: scm_string_map_x (proc, s, start, end)
PROC is a char->char procedure, it is mapped over S. The order in
which the procedure is applied to the string elements is not
specified. The string S is modified in-place, the return value is
not specified.
The current string character can gain a longer or a shorter byte length
in this process. This requires being able to deal with
insertion/deletion of bytes at the current position. One way to do that
is to have a _gap_ for the purpose of string-map! and similar functions
and copy material from the end of the gap to its start while iterating.
Now this starts looking so much like the memory management Emacs buffers
that it isn't funny. It also is _the_ natural string representation to
use for things like with-output-to-string-buffer and
with-input-to-string-buffer.
Basically, one valid string representation would coincide with an utf-8
based string buffer.
This would seem like it would also give a boost to some Guile operations
and would offer possibilities for making "string representation plugins"
for special purposes. This should offer a lot of leeway to gradually
suck up the best parts of things like Emacs' string representation and
code conversion facilities into a Guile kernel without being tied down
by the current Guile facilities and without having stuff like Emacs
strings black boxes mostly inaccessible to Guile programming.
It should also allow to bounce stuff like sanitized utf-8 through Guile
without incurring constant conversion costs.
So it should provide a _large_ opportunity for the sakes of applications
with profiles akin to Emacs or LilyPond.
--
David Kastrup
next reply other threads:[~2017-03-10 15:31 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-03-10 15:31 David Kastrup [this message]
2017-03-10 16:08 ` String internals sketch Andy Wingo
2017-03-10 16:51 ` David Kastrup
2017-03-15 15:46 ` David Kastrup
2017-03-11 0:30 ` Matt Wette
2017-03-11 6:44 ` David Kastrup
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87efy52lnp.fsf@fencepost.gnu.org \
--to=dak@gnu.org \
--cc=guile-user@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).