From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Clinton Ebadi Newsgroups: gmane.lisp.guile.devel Subject: Re: Wide strings Date: Wed, 28 Jan 2009 15:44:25 -0500 Message-ID: <87wscftb2u.fsf@unknownlamer.org> References: <470889.75847.qm@web37904.mail.mud.yahoo.com> <87wscjvwyq.fsf@gnu.org> <437818.2998.qm@web37907.mail.mud.yahoo.com> <87pri9lpab.fsf@gnu.org> <142660.24551.qm@web37906.mail.mud.yahoo.com> <87ljswk21l.fsf@gnu.org> <591698.58378.qm@web37905.mail.mud.yahoo.com> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1233175560 32114 80.91.229.12 (28 Jan 2009 20:46:00 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 28 Jan 2009 20:46:00 +0000 (UTC) To: guile-devel@gnu.org Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Wed Jan 28 21:47:14 2009 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1LSHJN-0002N7-Hw for guile-devel@m.gmane.org; Wed, 28 Jan 2009 21:47:02 +0100 Original-Received: from localhost ([127.0.0.1]:48952 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1LSHI5-0000Gv-C3 for guile-devel@m.gmane.org; Wed, 28 Jan 2009 15:45:41 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1LSHH3-0008GZ-Ln for guile-devel@gnu.org; Wed, 28 Jan 2009 15:44:37 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1LSHH1-0008FT-RP for guile-devel@gnu.org; Wed, 28 Jan 2009 15:44:37 -0500 Original-Received: from [199.232.76.173] (port=39279 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1LSHH1-0008FP-HY for guile-devel@gnu.org; Wed, 28 Jan 2009 15:44:35 -0500 Original-Received: from deleuze.hcoop.net ([69.90.123.67]:36363) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1LSHH1-00077z-1O for guile-devel@gnu.org; Wed, 28 Jan 2009 15:44:35 -0500 Original-Received: from cpe-024-211-230-216.nc.res.rr.com ([24.211.230.216] helo=rvannith) by deleuze.hcoop.net with esmtpsa (TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.63) (envelope-from ) id 1LSHGy-0003IW-8u; Wed, 28 Jan 2009 15:44:32 -0500 In-Reply-To: <591698.58378.qm@web37905.mail.mud.yahoo.com> (Mike Gran's message of "Wed, 28 Jan 2009 08:44:15 -0800 (PST)") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.0.60 (gnu/linux) X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6 (newer, 1) X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:8104 Archived-At: Mike Gran writes: > Hi, > > Let's say that one possible goal is to add wide strings > * using Gnulib functions > * with minimal changes to the public Guile API > * where chars become 4-byte codepoints and strings are internally > either UTF-32 or ISO-8859-1 > > Since I need this functionality taken care of, and since I have some > time to play with it, what's the procedure here? Should I mock > something up and submit it as a patch? If I did, it would likely be > a big patch. Do we need to talk more about what needs to be > accomplished? Do we need a complete specification? Do we need > a vote on if it is a good idea? You should take a look at Common Lisp strings[0] and streams[1]. The gist is that a string is a uniform array of some subtype of `character'[2], and character streams have an :external-encoding--character data is converted to/from that format when writing/reading the stream. Guile should be a bit more opaque and just specify the string as being an ordered sequence of characters, and providing conversion functions to/from uniform byte[^0] arrays in some explicitly specified encoding. The `scm_{to|from}_locale_string' functions provide enough abstraction to make this doable without breaking anything that doesn't use `scm_take_locale_string' (and even then Guile can detect when the locale is not UCS-4, revert to `scm_from_locale_string' and `free' the taken string immediately after conversion). This could be enhanced with `scm_{to|from}_encoded_string ({char*|SCM} string, enum encoding)' functions. > Pragmatically, I see that this can be broken up into three steps. > (Not for public use. Just as a programming subtasks.) > > 1. Convert the internal char and string representation to be > explicitly ISO 8859-1. Add the to/from locale conversion functionality > while still retaining 8-bit strings. Replace C library funcs with > Gnulib string funcs where appropriate. Initially, I would suggest just using UCS-4 internally and iconv[3] to handle conversion to/from the locale dependent encodings for C. Converting to an external encoding within `scm_to_{}_string' has minimal overhead really--the stringbuf has to be copied anyway (likewise for `scm_from_{}_string'). If you are writing the externally encoded string to a stream it is even cheaper--no memory need be allocated during conversion. I think it is acceptable to restrict the encoding of the string passed `scm_take_string'. If you are constructing strings that Guile can take possession of you probably have a bit of control over the encoding; if you don't generating a string and throwing it away more or less immediately is still pretty cheap if malloc doesn't suck. Adding a `scm_take_encoded_string' and removing the guarantee from `scm_take_locale_string' that Guile will not copy the string seems to be all that is needed to make taking strings work more or less transparently. > 2. Convert the internal representation of chars to 4-byte > codepoints, while still retaining 8-bit strings. > > 3. Convert strings to be a union of 1 byte and 4 byte chars. After getting a basic implementation done using a fixed with internal encoding rather than doing something like this it seems better to make the internal encoding flexible. Basically `make-string' would be extended with an :internal-encoding argument, or a new `make-string-with-internal-encoding' (with a better name perhaps) introduced to explicitly specify the internal encoding the application desires. An encoding would be implemented as a protocol of some sort that implemented a few primitive operations: conversion to UCS-4[^1], length, substring, concatenate, indexed ref, and indexed set! seem to be the minimal set for an optimizable implementation. Indices would have an unspecified type to allow for fancy internal encodings--e.g. a tree of some sort of UTF-8 codepoints that supported fast substring and concatenation. Allowing an internal encoding to not implement a destructive set! opens up some interesting optimizations for purely functional strings (e.g. for representing things like Emacs buffers using fancy persistent trees that are efficiently updateable and can maintain an edit history with nearly nil overhead). Does this seem reasonable? [0] http://www.lispworks.com/documentation/HyperSpec/Body/16_a.htm [1] http://www.lispworks.com/documentation/HyperSpec/Body/21_a.htm [2] http://www.lispworks.com/documentation/HyperSpec/Body/13_a.htm [3] http://www.gnu.org/software/libiconv/ [4] http://www.lispworks.com/documentation/HyperSpec/Body/f_by_by.htm [5] http://www.lispworks.com/documentation/HyperSpec/Body/f_ldb.htm [6] http://www.lispworks.com/documentation/HyperSpec/Body/f_dpb.htm#dpb [^0] `byte'[4] in CL language is some arbitrary width sequence of bits; e.g. a /traditional/ byte would be of type `(byte 0 7)' and a 32-bit machine word `(byte 0 31)'. Unrelatedly, you can do some neat things using these arbitrary width bytes with `ldb'[5]/`dpb'[6]. [^1] Minimally; ideally an internal encoding would be passed any format iconv understands and if possible convert directly to that, but if not use UCS-4 and punt to the default conversion function instead. -- emacsen: "Like... windows are portals man... emacsen: Dude... let's yank this shit out of the kill ring"