From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Mark H Weaver Newsgroups: gmane.lisp.guile.devel Subject: O(1) accessors for UTF-8 backed strings Date: Sat, 12 Mar 2011 23:05:27 -0500 Message-ID: <87tyf7oebc.fsf_-_@netris.org> References: <486722.32491.qm@web37905.mail.mud.yahoo.com> <87aah1qoc4.fsf@netris.org> <87zkp1p84m.fsf@netris.org> <87ei6csb85.fsf@gnu.org> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1299989152 21868 80.91.229.12 (13 Mar 2011 04:05:52 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Sun, 13 Mar 2011 04:05:52 +0000 (UTC) Cc: guile-devel@gnu.org To: ludo@gnu.org (Ludovic =?utf-8?Q?Court=C3=A8s?=) Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Sun Mar 13 05:05:48 2011 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1PycYt-00011O-Bi for guile-devel@m.gmane.org; Sun, 13 Mar 2011 05:05:47 +0100 Original-Received: from localhost ([127.0.0.1]:52055 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PycYs-0000TM-2I for guile-devel@m.gmane.org; Sat, 12 Mar 2011 23:05:46 -0500 Original-Received: from [140.186.70.92] (port=45862 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PycYo-0000Qf-AF for guile-devel@gnu.org; Sat, 12 Mar 2011 23:05:43 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1PycYn-00080Y-6C for guile-devel@gnu.org; Sat, 12 Mar 2011 23:05:42 -0500 Original-Received: from world.peace.net ([216.204.32.208]:50547) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1PycYl-00080E-Mr; Sat, 12 Mar 2011 23:05:39 -0500 Original-Received: from c-98-217-70-13.hsd1.ma.comcast.net ([98.217.70.13] helo=freedomincluded) by world.peace.net with esmtpa (Exim 4.69) (envelope-from ) id 1PycYg-0007kA-7v; Sat, 12 Mar 2011 23:05:34 -0500 Original-Received: from mhw by freedomincluded with local (Exim 4.69) (envelope-from ) id 1PycYe-0000QC-V9; Sat, 12 Mar 2011 23:05:32 -0500 In-Reply-To: <87ei6csb85.fsf@gnu.org> ("Ludovic =?utf-8?Q?Court=C3=A8s=22'?= =?utf-8?Q?s?= message of "Sat, 12 Mar 2011 14:46:18 +0100") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 216.204.32.208 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:11861 Archived-At: ludo@gnu.org (Ludovic Court=C3=A8s) writes: > I also think strings should remain what they currently are, with O(1) > random access. I just realized that it is possible to implement O(1) accessors for UTF-8 backed strings. The simplest solution is to break the stringbuf into chunks, where each chunk holds exactly CHUNK_SIZE characters, except for the last chunk which might have fewer. To find a character at index i, we scan forward (i % CHUNK_SIZE) characters from the beginning of chunk[i / CHUNK_SIZE]. Ideally, CHUNK_SIZE is a power of 2, which allows us to use bitwise operations instead of division. The number of characters to scan is bounded by CHUNK_SIZE, and therefore takes O(1) time. String-set! in general requires us to resize a chunk. Although chunks contain a constant number of characters, that of course maps to a variable number of bytes. There are various tricks we can do to reduce the number of reallocations done, but even in the worse case, the time spent is bounded by CHUNK_SIZE, and thus O(1). Converting a string to UTF-8 now consists of concatenating its chunks. For short strings (<=3D CHUNK_SIZE) we can simply return the address of the first chunk. In the general case, we'll have to recombine the chunks into a new block, which of course takes O(n) time. With a little added complexity, we can reduce this cost in common cases, and even eliminate it completely when string-set! is not used. The idea is to initially store a stringbuf in a single piece. Only after the first string-set! would the stringbuf be broken up into chunks. A flag in the stringbuf would indicate whether or not it is chunked. In a chunked string, each chunk[i] points to its own memory block. In an unchunked string, the entire string is in one block of UTF-8, pointed to by chunk[0]. chunk[i] for i>0 points into the middle of the block, at an offset of i*CHUNK_SIZE characters. This allows string-ref to treat chunked and unchunked stringbufs equivalently. It is probably worthwhile to actually _convert_ a stringbuf back into unchunked form whenever we convert a string to utf8. That way, in the common case where string-set! is only used to initialize the string, later conversions to utf8 can be done in constant time (except the first one, which can be charged to the initialization routine). Now, it is my hope that string-set! will be seldom used, and that we can encourage people to instead use string ports and/or string-unfold and string-unfold-right. These other functions can be implemented without string-set!, and will not require the use of the chunked representation. Therefore, it is my hope that the chunked representation can be avoided altogether in most all cases, which enables fast constant-time conversions to UTF-8, and thus the efficient use of libunistring and any other library that understands UTF-8. What do you think? Mark