From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Mark H Weaver <mhw@netris.org>
Newsgroups: gmane.lisp.guile.devel
Subject: O(1) accessors for UTF-8 backed strings
Date: Sat, 12 Mar 2011 23:05:27 -0500
Message-ID: <87tyf7oebc.fsf_-_@netris.org>
References: <486722.32491.qm@web37905.mail.mud.yahoo.com>
	<87aah1qoc4.fsf@netris.org> <87zkp1p84m.fsf@netris.org>
	<87ei6csb85.fsf@gnu.org>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Trace: dough.gmane.org 1299989152 21868 80.91.229.12 (13 Mar 2011 04:05:52 GMT)
X-Complaints-To: usenet@dough.gmane.org
NNTP-Posting-Date: Sun, 13 Mar 2011 04:05:52 +0000 (UTC)
Cc: guile-devel@gnu.org
To: ludo@gnu.org (Ludovic =?utf-8?Q?Court=C3=A8s?=)
Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Sun Mar 13 05:05:48 2011
Return-path: <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>
Envelope-to: guile-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.69)
	(envelope-from <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>)
	id 1PycYt-00011O-Bi
	for guile-devel@m.gmane.org; Sun, 13 Mar 2011 05:05:47 +0100
Original-Received: from localhost ([127.0.0.1]:52055 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1PycYs-0000TM-2I
	for guile-devel@m.gmane.org; Sat, 12 Mar 2011 23:05:46 -0500
Original-Received: from [140.186.70.92] (port=45862 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1PycYo-0000Qf-AF
	for guile-devel@gnu.org; Sat, 12 Mar 2011 23:05:43 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mhw@netris.org>) id 1PycYn-00080Y-6C
	for guile-devel@gnu.org; Sat, 12 Mar 2011 23:05:42 -0500
Original-Received: from world.peace.net ([216.204.32.208]:50547)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mhw@netris.org>)
	id 1PycYl-00080E-Mr; Sat, 12 Mar 2011 23:05:39 -0500
Original-Received: from c-98-217-70-13.hsd1.ma.comcast.net ([98.217.70.13]
	helo=freedomincluded) by world.peace.net with esmtpa (Exim 4.69)
	(envelope-from <mhw@netris.org>)
	id 1PycYg-0007kA-7v; Sat, 12 Mar 2011 23:05:34 -0500
Original-Received: from mhw by freedomincluded with local (Exim 4.69)
	(envelope-from <mhw@netris.org>)
	id 1PycYe-0000QC-V9; Sat, 12 Mar 2011 23:05:32 -0500
In-Reply-To: <87ei6csb85.fsf@gnu.org> ("Ludovic =?utf-8?Q?Court=C3=A8s=22'?=
	=?utf-8?Q?s?= message of "Sat, 12 Mar 2011 14:46:18 +0100")
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3)
X-Received-From: 216.204.32.208
X-BeenThere: guile-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Developers list for Guile,
	the GNU extensibility library" <guile-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/guile-devel>
List-Post: <mailto:guile-devel@gnu.org>
List-Help: <mailto:guile-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=subscribe>
Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.lisp.guile.devel:11861
Archived-At: <http://permalink.gmane.org/gmane.lisp.guile.devel/11861>

ludo@gnu.org (Ludovic Court=C3=A8s) writes:
> I also think strings should remain what they currently are, with O(1)
> random access.

I just realized that it is possible to implement O(1) accessors for
UTF-8 backed strings.

The simplest solution is to break the stringbuf into chunks, where each
chunk holds exactly CHUNK_SIZE characters, except for the last chunk
which might have fewer.  To find a character at index i, we scan forward
(i % CHUNK_SIZE) characters from the beginning of chunk[i / CHUNK_SIZE].
Ideally, CHUNK_SIZE is a power of 2, which allows us to use bitwise
operations instead of division.  The number of characters to scan is
bounded by CHUNK_SIZE, and therefore takes O(1) time.

String-set! in general requires us to resize a chunk.  Although chunks
contain a constant number of characters, that of course maps to a
variable number of bytes.  There are various tricks we can do to reduce
the number of reallocations done, but even in the worse case, the time
spent is bounded by CHUNK_SIZE, and thus O(1).

Converting a string to UTF-8 now consists of concatenating its chunks.
For short strings (<=3D CHUNK_SIZE) we can simply return the address of
the first chunk.  In the general case, we'll have to recombine the
chunks into a new block, which of course takes O(n) time.

With a little added complexity, we can reduce this cost in common cases,
and even eliminate it completely when string-set! is not used.  The idea
is to initially store a stringbuf in a single piece.  Only after the
first string-set! would the stringbuf be broken up into chunks.

A flag in the stringbuf would indicate whether or not it is chunked.  In
a chunked string, each chunk[i] points to its own memory block.  In an
unchunked string, the entire string is in one block of UTF-8, pointed to
by chunk[0].  chunk[i] for i>0 points into the middle of the block, at
an offset of i*CHUNK_SIZE characters.  This allows string-ref to treat
chunked and unchunked stringbufs equivalently.

It is probably worthwhile to actually _convert_ a stringbuf back into
unchunked form whenever we convert a string to utf8.  That way, in the
common case where string-set! is only used to initialize the string,
later conversions to utf8 can be done in constant time (except the first
one, which can be charged to the initialization routine).

Now, it is my hope that string-set! will be seldom used, and that we can
encourage people to instead use string ports and/or string-unfold and
string-unfold-right.  These other functions can be implemented without
string-set!, and will not require the use of the chunked representation.

Therefore, it is my hope that the chunked representation can be avoided
altogether in most all cases, which enables fast constant-time
conversions to UTF-8, and thus the efficient use of libunistring and any
other library that understands UTF-8.

What do you think?

      Mark