O(1) accessors for UTF-8 backed strings

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

From: Mark H Weaver <mhw@netris.org>
To: ludo@gnu.org (Ludovic Courtès)
Cc: guile-devel@gnu.org
Subject: O(1) accessors for UTF-8 backed strings
Date: Sat, 12 Mar 2011 23:05:27 -0500	[thread overview]
Message-ID: <87tyf7oebc.fsf_-_@netris.org> (raw)
In-Reply-To: <87ei6csb85.fsf@gnu.org> ("Ludovic Courtès"'s message of "Sat, 12 Mar 2011 14:46:18 +0100")

ludo@gnu.org (Ludovic Courtès) writes:
> I also think strings should remain what they currently are, with O(1)
> random access.

I just realized that it is possible to implement O(1) accessors for
UTF-8 backed strings.

The simplest solution is to break the stringbuf into chunks, where each
chunk holds exactly CHUNK_SIZE characters, except for the last chunk
which might have fewer.  To find a character at index i, we scan forward
(i % CHUNK_SIZE) characters from the beginning of chunk[i / CHUNK_SIZE].
Ideally, CHUNK_SIZE is a power of 2, which allows us to use bitwise
operations instead of division.  The number of characters to scan is
bounded by CHUNK_SIZE, and therefore takes O(1) time.

String-set! in general requires us to resize a chunk.  Although chunks
contain a constant number of characters, that of course maps to a
variable number of bytes.  There are various tricks we can do to reduce
the number of reallocations done, but even in the worse case, the time
spent is bounded by CHUNK_SIZE, and thus O(1).

Converting a string to UTF-8 now consists of concatenating its chunks.
For short strings (<= CHUNK_SIZE) we can simply return the address of
the first chunk.  In the general case, we'll have to recombine the
chunks into a new block, which of course takes O(n) time.

With a little added complexity, we can reduce this cost in common cases,
and even eliminate it completely when string-set! is not used.  The idea
is to initially store a stringbuf in a single piece.  Only after the
first string-set! would the stringbuf be broken up into chunks.

A flag in the stringbuf would indicate whether or not it is chunked.  In
a chunked string, each chunk[i] points to its own memory block.  In an
unchunked string, the entire string is in one block of UTF-8, pointed to
by chunk[0].  chunk[i] for i>0 points into the middle of the block, at
an offset of i*CHUNK_SIZE characters.  This allows string-ref to treat
chunked and unchunked stringbufs equivalently.

It is probably worthwhile to actually _convert_ a stringbuf back into
unchunked form whenever we convert a string to utf8.  That way, in the
common case where string-set! is only used to initialize the string,
later conversions to utf8 can be done in constant time (except the first
one, which can be charged to the initialization routine).

Now, it is my hope that string-set! will be seldom used, and that we can
encourage people to instead use string ports and/or string-unfold and
string-unfold-right.  These other functions can be implemented without
string-set!, and will not require the use of the chunked representation.

Therefore, it is my hope that the chunked representation can be avoided
altogether in most all cases, which enables fast constant-time
conversions to UTF-8, and thus the efficient use of libunistring and any
other library that understands UTF-8.

What do you think?

      Mark

next prev parent reply	other threads:[~2011-03-13  4:05 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-03-11  0:54 uc_tolower (uc_toupper (x)) Mike Gran
2011-03-11 22:33 ` Using libunistring for string comparisons et al Mark H Weaver
2011-03-11 22:36   ` Mark H Weaver
2011-03-11 23:09   ` Mark H Weaver
2011-03-12 13:46     ` Ludovic Courtès
2011-03-12 17:28       ` Mark H Weaver
2011-03-13 21:30         ` Ludovic Courtès
2011-03-30  9:05           ` Andy Wingo
2011-03-13  4:05       ` Mark H Weaver [this message]
2011-03-13  4:42         ` O(1) accessors for UTF-8 backed strings Alex Shinn
2011-03-15 15:46           ` Mark H Weaver
2011-03-16  0:07             ` Alex Shinn
2011-03-19 13:02             ` Andy Wingo
2011-03-30  9:20         ` Andy Wingo
2011-03-30  9:03     ` Using libunistring for string comparisons et al Andy Wingo
2011-03-31 14:19       ` Ludovic Courtès
2011-03-12 13:36   ` Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87tyf7oebc.fsf_-_@netris.org \
    --to=mhw@netris.org \
    --cc=guile-devel@gnu.org \
    --cc=ludo@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).